How do I parse entire Common Crawl database using node?

How do I parse entire Common Crawl database using node? - javascript

I want to get as many html files from Common Crawl database as possible. I'm quite lost on how to do it, and don't even know how to start. I've seen many people doing it in python, but I don't know how to adequate the code to javascript. I found this package: https://www.npmjs.com/package/commoncrawl
But this package can only search, and not parse through every single website of the database.
Also, I want the raw html data from the websites only, and a way to get the link of the website. Shouldn't be that hard.

The commoncrawl package looks like it is used for navigating their CDX indexes.
If you want the underlying HTML itself, you need to be inspecting the WARC files. Consider using something like node-warc.
I wrote a blog post that introduces the WARC format and provides examples on how to fetch and search HTML from the Common Crawl in Node, Java, Go and Python. You can find the Node code on Github here. Hope that helps!

Related

On-Disk Text Processing With Javascript

I have some html files that I need to do automated processing on, basically regex replaces, but also some more complex actions like copying select blocks of text from one file to another.
I want to create a series of scripts that will let me do this processing (it will need to be done more than once on different batches of files). It would be trivial to use Go for this (read the file into memory, regex, save to disk) but I am the only member of the project that's familiar with Go.
Javascript is a tad more ubiquitous, and I do have project members who are familiar with the language, so it's a better fit in that respect. If I'm not around later, someone else could edit the scripts.
Is there a simple way to write some JS scripts to do on-disk text processing? I'm looking for a cross-platform solution (OSX, Windows). Ideally, once the scripts are written, they can be executed by double-clicking an icon--there will be "not computer people" involved at some point.
Also, I'd like to be able to do some kind of alert/message box to inform the user of the success/failure of the script. (This may be a tall order, and is of secondary importance.)
What I've looked at:
Node.js was the first thing that popped into my head, because I know that it has file system access tools, and obviously regex capacity. But I've never used Node before, and based on the tutorials I've read, it seems like overkill for something this simple.
There's a whole slew of "javascript compiling" tools that you can find by googling around. Some are not cross-platform, some seem old or not actively maintained, etc. None of them caught my eye as easy to pick up and just write some JS scripts with.
Any thoughts?

Node.js is a simple solution and with it's framework you can create or later modify your script to your needs. This way you will not be locked down by someone else's code. And it is not that difficult to to use.
Here is a quick tutorial on accesing files using node.js
http://www.sitepoint.com/accessing-the-file-system-in-node-js/
And here is a quick tutorial on using a node module called Cheerio. It allow you to access html files using "jquery like syntax". You don't need to use regex.
http://maxogden.com/scraping-with-node.html
I worked on a project for a client once and it required parsing thru hundreds of html files to check and replace certain image files based on certain criterias. I wasn't familiar with node at the time so I read some tutorials and wrote the script in about an hour.
And as long as Nodejs' path is set, you can run it on the command line.

Some tips:
You need any kind of DOM HTML parser, not only JS nor specifically JS.
You can do that thing with Java with use of jTidy or jSoup libraries (I've used second one few times). It's pretty simple language to learn if you know JS and IDE like Netbeans helps a lot. So can be made quickly with that.
You can use PhantomJS to create some job files and create shell/batch code to run them on some files. You might need to write a generator for job files (like taking a list of files, creating job files for each and running them).
You can use Node.js which isn't much overkill, I'm sure any solution won't be trivial.
You can create an ETL for processing with for example Pentaho ETL (which has JS embedded as one of two scripting languages... but without DOM parser - for that one you would need to use a bit of Java there and some library in way similar to this article).
You can also do that with PHP with Simple HTML DOM Parser - so you can make a service online (or on local server) that takes those html files and throws out processed ones.

First I think you underestimate the complexity. The statement
"It would be trivial to use Go for this (read the file into memory,
regex, save to disk) but I am the only member of the project that's
familiar with Go."
is probably false. Parsing HTML with RegExp is just a bad idea. (Google it and you will see why)
Second, if you can trivially write the code using RegExps in Go, you can just as easily write the same thing in Javascript. They both support RegExp and file operations. If you are unsure about the Javascript/Node.js details, I suggest writing the trivial solution in Go and then translate the thing into Javascript with a colleague.
Since Javascript is a script language, writing command line utilities in Node.js is straight forward.
Some pointers to get you started
RegExp in Javascript
Building command line apps in Node.js

Parse markdown on the fly

I want to implement markdown to my forums.
I research many possible approaches how I would do that and this is what I thought:
A simple approach would involve pagedown on client side and php-markdown on the server.
My approach is to save pure markdown to database and when displaying convert to HTML (with pagedown). Since I already have security layer for my server side (HTML elements whitelist) and all the necessary staff I don't see anything to lose here.
What I win in this case? well I have to modify pagedown to use custom buttons and patterns. That would be hard for me to maintain both php and JavaScript.
My question is: is this good aproach?
To break this question:
Is there any serious overhead on client side loading about 30
posts and converting it to HTML (performance)
With the Idea that I check elements whitelist, is there any
security issue I need to know about? (Security)

I wouldn't use client side markdown engines. From a few quick googles its of the opinion it's very CPU intensive. Loading 30 posts would add quite a bit of overhead.
If you stored MD in the DB, rendered to HTML on the fly, then employed some caching (memcached or redis) that could work quite well.
In regards to security theres a good read here, it would require some extra sanitising removing scripts/links/redirects etc.
Further reading
http://functionn.in/resource/remarked-js
How to use Markdown & MySQL?

Converting markdown from client side is not recommended as #Lex has stated. Instead, you can use some online services to convert the markdown top html for you.
Have a look at http://daringfireball.net/projects/markdown/dingus. You can use curl or something to post markdown to the site and then scrap the website to take the html part.
You can also have a look at here http://parsedown.org/

You have two options to suggest:
Strapdown - allows to create markdown documents without server-side processing, as you can see on there page, even without code, just by using static files
markdown-js - allows to create markdown document with client-side processing (javascript)

Here is how I do it:
Save markdown code in db and at rendering I'm caching the result in a file (file, Memcached or any cache storage engine you want). This way I keep the original in database and I`m not wasting resources to compile markdown at each page visit, instead I serve the cache file until it has expired or deleted because of a change.

Accounts in serverside JavaScript (node.js)

I'm making an online game and I was wondering - how can users create accounts? Do I need a database, and if yes, which one? Also how do I get information from the users? I believe by using html tag, but how exactly? I'm not using PHP for the serverside, but node.js.

There are so many answers to this question it's tough to begin at one place.
I'll just suggest some technologies because that seems to be what you're looking for. Ultimately I recommend you research this area and make up your own mind on what you'd like to use.
You can use a database to store the user information. NoSQL is popular nowadays so I'll go for a MongoDB solution http://www.mongodb.org/
You don't exactly need to use HTML tags as there are template solutions written for node.js. I recommend jade https://github.com/visionmedia/jade
There are frameworks and middleware created to make all of this easier. Check it out here https://github.com/joyent/node/wiki/modules

Advice on starting JS or JQuery for file processing

My knowledge on web technologies (JS, JQ) are limited and I want to start learning them. As a starting point I want to do some file processing. Because it is something I have to do for my work and was planning to do it in Java. What I basically need to do is to go through a list of text files (assembly files) in a folder and search for routines and then list them. This is the first step and is a trivial task in Java.
But I wanted to take this a step further and do it in the browser, so that others in my team also can use it without installing anything (and also to impress them a little bit in the process. since I'm the new guy in the team :-)).
So when I input the folder, the script will go through the files and search and will display results in a web page. Basically first page will be a list of files in the folder, and clicking a file name will take me to another page which displays the routines in that file.
Sorry to bother you with details, but what I actually want to know are:
Is this possible with JS? (to
search for text patterns in a file)
Should I start with JS or JQ? (I
think many would recommend starting
with JS, but since this is a side
project and this is done purely in
my own time, would you suggest start
learning JQ because it's relatively
simpler to learn (from what I have
read) for a beginner?
Or should I just do the processing
in JAva and then interface the
results to a webpage
Any advice is appreciated.
Thank you very much.

Java and JavaScript have nothing to do with each other, jQuery is library written to simplify usage of JavaScript with some handy shortcuts.
I'm afraid JavaScript would not be able to parse text files as its main usage is manipulating content inside browser window and limited by different security policies.
To parse files you have to chose server side language.

maybe you can use java to deal with the file processing, and then send the result to js script , which will show these results to users.
js's ability is limited

For security reasons, JavaScript is sandboxed within the browser, and has basically no access to the local file system. From what you have described, it sounds like your best option is to use Java to process ...whatever...
This function has nothing to do with web browsing. Why is a browser the best tool for the job, anyway?

Store website not in database but in HTML

Could you recommend me a nice technology for my problem? I had a blog and now it as a classical dynamic page MySql + Php + Javascript. But i would rewrite it. If an user event happend, i.e. a user posted a comment, i wouldn't store it in database, but store the changes in an HTML file. So the logic is not in the php file but in html file. How can i do that, but in an easy way.

You could theoretically parse the static html file after a user event (i.e.: POST) and append the result to the html, write back to the file, however I wouldn't recommend it as the script would have to be fairly complex to handle the html correctly.

If you want to edit static HTML files when the user posts a comment than it is a bad approach IMHO. How would you implement searching, for example?
Don't try replicating what database is doing, because the guys that wrote database spend their whole lives doing just that.
You might want to search for existing blog solutions instead of writing one from scratch. There are many open-source projects written in many languages / platforms.

Develop Reference

JavaScript is the programming language of the Web.