Webscraper in node.js, JS modifies DOM

Webscraper in node.js, JS modifies DOM - javascript

I'm trying to write a webscraper, to get some sales leads. The problem is that in modern webdesign, most of websites uses some JavaScript to modify DOM (usually using React, Angular, or even just some jQuery). The problem is, that if I scrap some website by request node.js package, and pass html code to cheerio, then I'm simply not able to parse the code and get the info I want. Instead, all I can see are some React.js components ¯_ツ_/¯
Any resources on this topic will be helpful, thanks in advance.

Because the request package will not execute any of the javascript on the page. It will just download the html as is. If you want to see the actual page like a browser does, you would have to create a javascript parser that executes all javascript code in the state you want it to.
Luckily, there are some other options here:
You could take a look at the developer tools on the website you want to scrape and try to find the xhr requests that fetches the data you need. Then you can call this url directly.
You could use headless browser scraping like PhantomJS or CasperJS. These are packages that will try and modify the downloaded dom as good as possible with the included javascript resources.

Related

React.js server side rendering with PHP

I would like to develop themes/plugins for WordPress based on React. To make it search engine friendly, I need it to be rendered initially on the server (serverside-rendering).
The only way to do this, as far as I know, is to use react-php-v8js, which requires the PECL V8js extension. This is a problem since I have no control over the platform on which these themes/plugins will be run.
Is there a way to make React and WordPress work together without having to install additional extensions? Perhaps by building/compiling React files into PHP?

There's an article that describes how to do this:
https://sebastiandedeyne.com/server-side-rendering-javascript-from-php/
But it's a fairly complex setup and it requires using composer. That can be difficult in Wordpress projects since Wordpress tends to completely eschew the modern php architecture.
If you're looking for a library to help with SSR in PHP:
https://github.com/spatie/server-side-rendering
Best of luck on it.

If you want your content to be indexed by search engine without js, you can print your minimal content using Wordpress, just the bare minimum + crucial meta tags, maybe localize some initial state for your react app to boot. A bare bone theme such http://underscores.me/ would be sufficient. When js is available, you can replace your whole WordPress generated content with React ones.
The ideal one is to have React generate the content for you. But it's hard until we can see that nodejs / PECL V8js extension available everywhere.

If you can at least install nodejs and launch a node process then it should be ok, although not so simple.
You would need to generate a ssr version of your assets and use it in a node process that would listen on a socket to write the html result..
In your controller you can create a socket to your node process (something like stream_socket_client(...)) and then you can send a dummy function written as a javascript string to that socket (something like stream_socket_sendto($sock, "getResultForMyWidget(someParams){...}")). That function would be evaluated in the node process that would return the response to the controller (the html response as a ReactDOMServer.renderToString from the component you want to render).
That's it for the big picture.
There is a symfony plugin that illustates it very clearly (see this github) and comes with a dummy server node process to illustrate how it handles the socket listening and eval of the incoming function and returns the html result. See also the example in the sandbox for a bigger picture and in depth implementation. You should be able to adapt it to wordpress.

How to use Javascript as a backend script to make server side api calls without using HTML in Jint

This link sets up the context for this question.
I am trying to use Javascript as the backend code for my mobile application(windows) which has a native UI(not HTML UI). That means I don't have HTML. Hence I don't have DOM.
I have successfully been able to create functions which does some computation locally like add, call them from my C# or Java code and get the return values.
Now I am running into 2 problems
1) I am trying to take my javascript functions one step further by trying to call server apis from the Javascript using XMLHttpRequest. But I am getting a script execution exception in my C#(This usually occurs if the script was unable to run). I think this is because XMLHttpRequest needs DOM and I don't have DOM.
If my first problem is solved then
2)How to get hold of a different Javascript file when I don't have DOM. For example let's say I want to use Jquery to simplify my requests using $.Ajax etc, but how do I load the Jquery library because I don't have the luxury of script tag neither do I have the luxury of using $.getScript because I am trying to get Jquery itself.
One possible solution
There is one solution that I can think on top of my head. Use a webview and load your html in the WebView and use Jint for Javascript. But then the question arises How do I use WebView in par with Jint.
I would be glad if someone can point me in the right direction.
Thanks in advance.

How to load static resources from server-side javascript in CouchDB

For CouchDB, I know that show function can generate HTMLs / Images / XML feed on the fly.
While in that case they have to be in the script itself and encoded (e.g. base 64 for image), as in here
What is the best way to load static resources which are attachments of design documents
e.g. As simple as JSON, or Images and process with server-side javascript?
The script file itself is an attachment in the design doc. The variable doc is not available.
Are there any way similar to node.js for it? or we use trick in context like _show or _list to show the document with id: _design/ddoc ?
doing REST request inside that environment I believe is also not possible as XMLHttpRequest is also not available. Establishing DB connection is also not possible?
This supposed to be a simple question, I wonder I am missing something in couchDB?

In order to serve a website directly, you need to use url rewrites. You'd rewrite / to got to one of your show functions. to bootstrap your site with basic HTML and JS (embedded probably).
A lot of this work has been done already by CouchApps (basic tutorial here). This is by far the easiest way to get started. This seems to be the way http://npmjs.org is served.
This isn't the place for a walkthrough, so hopefully this gives you enough information to get started.
If your site needs server-side logic (websockets for example), this solution won't work for you. All you get with a couch app is a database, HTML, CSS and Javascript.

How can I load an html file without issuing an http request using jquery?

I will try to summarize the best I can what I need and what is blocking me to do it.
What I need
I need to append script tags to the head of an html file, BUT during my "build" process. I'm using ant as a automation build tool, and I would like to avoid placing tokens in my HTML file to then replace it with ant, or also I will like to avoid any midway solution using regular expression matching. Waht I would really like to use is plain javascript running through rhino javascript interpreter and exceute it easily from an ant task, and finally add the script tag dinamically.
What is blocking me?
I really don't know anyway that I can load an html file without issuing a GET or a POST HTTP methods. Cause I'm building my code from source I don't have it under an HTTP server, so I wish I could find someway to load the HTML DOM into a javascript variable and then write it with the new script tag that I need.
I need all the DOM manipulation features without having a browser that renders the HTML file.
Best!
Demian

From what I understand you would like to have a valid DOM object from an HTML file, as if you were running in a browser, but do it "offline"? e.g. be able to do a jQuery selector on the DOM and edit it?
You can always start by looking into an embeded open source browser (http://www.chromium.org/)?
But I would look into node.js, see this question Can I use jQuery with Node.js?
This will allow you to do DOM traversing and modifications without a browser as far as I understand

Web crawler: Using Perl's MozRepl module to deal with Javascript

I am trying to save a couple of web pages by using a web crawler. Usually I prefer doing it with perl's WWW::Mechanize modul. However, as far as I can tell, the site I am trying to crawl has many javascripts on it which seem to be hard to avoid. Therefore I looked into the following perl modules
WWW::Mechanize::Firefox
MozRepl
MozRepl::RemoteObject
The Firefox MozRepl extension itself works perfectly. I can use the terminal for navigating the web site just the way it is shown in the developer's tutorial - in theory. However, I have no idea about javascript and therefore am having a hard time using the moduls properly.
So here is the source i like to start from: Morgan Stanley
For a couple of listed firms beneath 'Companies - as of 10/14/2011' I like to save their respective pages. E.g. clicking on the first listed company (i.e. '1-800-Flowers.com, Inc') a javascript function gets called with two arguments -> dtxt('FLWS.O','2011-10-14'), which produces the desired new page. The page I now like to save locally.
With perl's MozRepl module I thought about something like this:
use strict;
use warnings;
use MozRepl;
my $repl = MozRepl->new;
$repl->setup;
$repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")');
$repl->repl_enter({ source => "content" });
$repl->execute('dtxt("FLWS.O", "2011-10-14")');
Now I like to save the produced HTML page.
So again, the desired code I like to produce should visit for a couple of firms their HTML site and simply save the web page. (Here are e.g. three firms: MMM.N, FLWS.O, SSRX.O)
Is it correct, that I cannot go around the page's javascript functions and therefore cannot use WWW::Mechanize?
Following question 1, are the mentioned perl modules a plausible approach to take?
And finally, if you say the first two questions can be anwsered with yes, it would be really nice if you can help me out with the actual coding. E.g. in the above code, the essential part which is missing is a 'save'-command. (Maybe using Firefox's saveDocument function?)

The web works via HTTP requests and responses.
If you can discover the proper request to send, then you will get the proper response.
If the target site uses JS to form the request, then you can either execute the JS,
or analyse what it does so that you can do the same in the language that you are using.
An even easier approach is to use a tool that will capture the resulting request for you, whether the request is created by JS or not, then you can craft your scraping code
to create the request that you want.
The "Web Scraping Proxy" from AT&T is such a tool.
You set it up, then navigate the website as normal to get to the page you want to scrape,
and the WSP will log all requests and responses for you.
It logs them in the form of Perl code, which you can then modify to suit your needs.

Develop Reference

JavaScript is the programming language of the Web.

Webscraper in node.js, JS modifies DOM - javascript

Related

React.js server side rendering with PHP

How to use Javascript as a backend script to make server side api calls without using HTML in Jint

How to load static resources from server-side javascript in CouchDB

How can I load an html file without issuing an http request using jquery?

Web crawler: Using Perl's MozRepl module to deal with Javascript

Categories

Resources