Web crawler: Using Perl's MozRepl module to deal with Javascript

Web crawler: Using Perl's MozRepl module to deal with Javascript - javascript

I am trying to save a couple of web pages by using a web crawler. Usually I prefer doing it with perl's WWW::Mechanize modul. However, as far as I can tell, the site I am trying to crawl has many javascripts on it which seem to be hard to avoid. Therefore I looked into the following perl modules
WWW::Mechanize::Firefox
MozRepl
MozRepl::RemoteObject
The Firefox MozRepl extension itself works perfectly. I can use the terminal for navigating the web site just the way it is shown in the developer's tutorial - in theory. However, I have no idea about javascript and therefore am having a hard time using the moduls properly.
So here is the source i like to start from: Morgan Stanley
For a couple of listed firms beneath 'Companies - as of 10/14/2011' I like to save their respective pages. E.g. clicking on the first listed company (i.e. '1-800-Flowers.com, Inc') a javascript function gets called with two arguments -> dtxt('FLWS.O','2011-10-14'), which produces the desired new page. The page I now like to save locally.
With perl's MozRepl module I thought about something like this:
use strict;
use warnings;
use MozRepl;
my $repl = MozRepl->new;
$repl->setup;
$repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")');
$repl->repl_enter({ source => "content" });
$repl->execute('dtxt("FLWS.O", "2011-10-14")');
Now I like to save the produced HTML page.
So again, the desired code I like to produce should visit for a couple of firms their HTML site and simply save the web page. (Here are e.g. three firms: MMM.N, FLWS.O, SSRX.O)
Is it correct, that I cannot go around the page's javascript functions and therefore cannot use WWW::Mechanize?
Following question 1, are the mentioned perl modules a plausible approach to take?
And finally, if you say the first two questions can be anwsered with yes, it would be really nice if you can help me out with the actual coding. E.g. in the above code, the essential part which is missing is a 'save'-command. (Maybe using Firefox's saveDocument function?)

The web works via HTTP requests and responses.
If you can discover the proper request to send, then you will get the proper response.
If the target site uses JS to form the request, then you can either execute the JS,
or analyse what it does so that you can do the same in the language that you are using.
An even easier approach is to use a tool that will capture the resulting request for you, whether the request is created by JS or not, then you can craft your scraping code
to create the request that you want.
The "Web Scraping Proxy" from AT&T is such a tool.
You set it up, then navigate the website as normal to get to the page you want to scrape,
and the WSP will log all requests and responses for you.
It logs them in the form of Perl code, which you can then modify to suit your needs.

Related

Webscraper in node.js, JS modifies DOM

I'm trying to write a webscraper, to get some sales leads. The problem is that in modern webdesign, most of websites uses some JavaScript to modify DOM (usually using React, Angular, or even just some jQuery). The problem is, that if I scrap some website by request node.js package, and pass html code to cheerio, then I'm simply not able to parse the code and get the info I want. Instead, all I can see are some React.js components ¯_ツ_/¯
Any resources on this topic will be helpful, thanks in advance.

Because the request package will not execute any of the javascript on the page. It will just download the html as is. If you want to see the actual page like a browser does, you would have to create a javascript parser that executes all javascript code in the state you want it to.
Luckily, there are some other options here:
You could take a look at the developer tools on the website you want to scrape and try to find the xhr requests that fetches the data you need. Then you can call this url directly.
You could use headless browser scraping like PhantomJS or CasperJS. These are packages that will try and modify the downloaded dom as good as possible with the included javascript resources.

Get the minimal amount of code used in JQuery/JSApi (legally)?

I have built a really simple script using JQuery/JSApi. I want to deploy it to my raspberry pi. As such, I need to make it only use the minimal amount of code possible (Pi is already full!).
It will not have a network connection, so let's say I just want to grab one file from the JSApi (as an example - I will not actually do this as it isn't legal).
So, I opened fiddler, opened my webpage, and saw what dependencies it had. After loading the JSApi, it fetched the following two files :
GET www.google.com/uds/api/visualization/1.0/69d4d6122bf8841d4832e052c2e3bf39/format+en,default+en.I.js HTTP/1.1
GET www.google.com/uds/?file=visualization&v=1 HTTP/1.1
So, two questions - firstly, is there any legal way for me to get this file and host it locally for JSApi?
Assuming no is the answer to this, let's assume the files are JQuery modules - where I believe this would be allowed. How would I grab them, and point to them locally? When I try to navigate to either of the addresses above, for example, I get an error message or nothing loads - so it is not possible to include these modules (or other JQuery modules) separately ?
Thank you!

I don't know if this will work in your exact setup, but you can do something like the following to get it pared down to the bare essentials:
Write up your webpage with the full jQuery package and any modules necessary (per your second example)
Pass all the javascript you're using through something like Google's Closure Compiler: https://closure-compiler.appspot.com/home
Include that compiled javascript, which will include only the absolutely necessary functions.

Angularjs vs SEO vs pushState

After reading this thread I decided to use pushstate api in my angularjs application which is fully API-based (independent frontend and independent backend).
Here is my test site: http://huyaks.com/index.html
I created a sitemap and uploaded to google webmaster tools.
From what I can see:
google indexed the main page, indexed the dynamic navigation (cool!) but did not index
any of dynamic urls.
Please take a look.
I examined the example site given in the related thread:
http://html5.gingerhost.com/london
As far as I can see, when I directly access a particular page the content which is presumed to be dynamic is returned by the server therefore it's indexed. But it's impossible in my case since my application is fully dynamic.
Could you, please, advise, what's the problem in my particular case and how to fix it?
Thanks in advance.
Note: this question is about pushState way. Please do not advise me to use escaped fragment or 3-d party services like prerender.io. I'd like to figure out how to use this approach.

Evidently Quentin didn't read the post you're referring to. The whole point of http://html5.gingerhost.com/london is that it uses pushState and proves that it doesn't require static html for the benefit of spiders.
"This site uses HTML5 wizrdry [sic] to load the 'actual content' asynchronusly [sic] to the rest of the code: this makes it faster for users, but it's still totally indexable by search engines."
Dodgy orthography aside, this demo shows that asynchronously-loaded content is indexable.

As far as I can see, when I directly access a particular page the content which is presumed to be dynamic is returned by the server
It isn't. You are loading a blank page with some JavaScript in it, and that JavaScript immediately loads the content that should appear for that URL.
You need to have the server produce the HTML you get after running the JavaScript and not depend on the JS.

Google does interpret Angular pages, as you can see on this quick demo page, where the title and meta description show up correctly in the search result.
It is very likely that if they interpret JS at all, they interpret it enough for thorough link analysis.
The fact that some pages are not indexed is due to the fact that Google does not index every page they analyze, even if you add it to a sitemap or submit it for indexing in webmaster tools. On the demo page, both the regular and the scope-bound link are currently not being indexed.
Update: so to answer the question specifically, there is no issue with pushState on the test site. Those pages simply do not contain value-adding content for Google. (See their general guidelines).

Sray, I recently opened up the same question in another thread and was advised that Googlebot and Bingbot do index SPAs that use pushState. I haven't seen an example that ensures my confidence, but it's what I'm told. To then cover your bases as far as Facebook is concerned, use open graph meta tags.
I'm still not confident about pushing forward without sending HTML snippets to bots, but like you I've found no tutorial telling how to do this while using pushState or even suggesting it. But here's how I imagine it would work using Symfony2...
Use prerender or another service to generate static snippets of all your pages. Store them somewhere accessible by your router.
In your Symfony2 routing file, create a route that matches your SPA. I have a test SPA running at localhost.com/ng-test/, so my route would look like this:
# Adding a trailing / to this route breaks it. Not sure why.
# This is also not formatting correctly in StackOverflow. This is yaml.
NgTestReroute:
----path: /ng-test/{one}/{two}/{three}/{four}
----defaults:
--------_controller: DriverSideSiteBundle:NgTest:ngTestReroute
--------'one': null
--------'two': null
--------'three': null
--------'four': null
----methods: [GET]
In your Symfony2 controller, check user-agent to see if it's googlebot or bingbot. You should be able to do this with the code below, and then use this list to target the bots you're interested in (http://www.searchenginedictionary.com/spider-names.shtml)...
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
If your controller finds a match to a bot, send it the HTML snippet. Otherwise, as in the case with my AngularJS app, just send the user to the index page and Angular will correctly do the rest.
Also, has your question been answered? If it has, please select one so I and others can tell what worked for you.
HTML snippets for AngularJS app that uses pushState?

Converting a web app into an embeddable <script> tag

I just did a proof of concept/demo for a web app idea I had but that idea needs to be embedded on pages to work properly.
I'm now done with the development of the demo but now I have to tweak it so it works within a tag on any websites.
The question here is:
How do I achieve this without breaking up the main website's stylesheets and javascript?
It's a node.js/socket.io/angularjs/bootstrap based app for your information.
I basically have a small HTML file, a few css and js files and that's all. Any idea or suggestions?

If all you have is a script tag, and you want to inject UI/HTML/etc. into the host page, that means that an iframe approach may not be what you want (although you could possibly do a hybrid approach). So, there are a number of things that you'd need to do.
For one, I'd suggest you look into the general concept of a bookmarklet. While it's not exactly what you want, it's very similar. The problems of creating a bookmarklet will be very similar:
You'll need to isolate your JavaScript dependencies. For example, you can't load a version of a library that breaks the host page. jQuery for example, can be loaded without it taking over the $ symbol globally. But, not all libraries support that.
Any styles you use would also need to be carefully managed so as to not cause issues on the host page. You can load styles dynamically, but loading something like Bootstrap is likely going to cause problems on most pages that aren't using the exact same version you need.
You'll want your core Javascript file to load quickly and do as much async work as possible as to not affect the overall page load time (unless your functionality is necessary). You'll want to review content like this from Steve Souders.
You could load your UI via a web service or you could construct it locally.
If you don't want to use JSONP style requests, you'll need to investigate enabling CORS.
You could use an iframe and PostMessage to show some UI without needing to do complex wrapping/remapping of the various application dependencies that you have. PostMessage would allow you to send messages to tell the listening iFrame "what to do" at any given point, while the code that is running in the host page could move/manipulate the iframe into position. A number of popular embedded APIs have used this technique over the years. I think DropBox was using it for example.

newbie question about javascript embed code?

I am a javascript newbie. I am trying to write a requirements document, and need some help describing what I am looking for. We want our application to generate a javascript snippet like this:
<script src="http://www.jotform.com/jsform/10511502633"></script>
This will load a web form.
So my question is:
- How does a single script load an entire web form? Is this a JSON?
- What is this called? Is this a cross browser javascript?
- Can anyone point me in the direction of learning more about what this is?
Thank you for your help!

The javascript file is just hosted on an external site. It appears to be dynamically generated, so feel free to use some fancy words ;) But basically, you just include it here, as if it was on your own site.
You could say "The application will generate the required script-tags to include dynamically generated javascript file from an external, third-party site".
Offcourse you need to take special cautions for cases when the include won't work, because the other site is not reachable (site is down, DNS does not work, file is moved on other webserver, your application is on an intranet/behind a proxy/firewall...). Why can't you copy their file and mirror it locally? Or use a reliable Content Delivery Network, like Google or Amazon.

There are many names for this type of inclusion. The most common being widget.
What does it actually do:
take an id of some sort as parameter
use the id to fetch some specific data (most likely from a database)
generate some js and html based on the id/data
usually this involves iframes of some sort.
To use a script rather than an html iframe has multiple advantages
you can change what is actually delivered to the users browsers without changing the include
you can resize the iframe to fit certain predefined sizes
you can inject the necessary things into the page the widget is included (of course you need to make sure this is sanctioned)
We use this all the time and we never regreted it.
If you don't want to build the widget infrastructure yourself you can always use one of the widget providers like widgetbox:
http://www.widgetbox.com/widgets/make/
With those you are up and running in no time.

This is typically called a script include.

Google have lots of these types of items, and even they call them by many names,
widgets, custom javascript, snippets, custom code, etc. It really depending on who you are writing for... I would go with "cross platform embeddable javascript code" meaning that it would need to load all its dependancies. Also specify which browsers need to be supported and what should happen is the user has javascript turned off.
EDIT :
Actually since we are talking unique IDs, you will need 2 parts probably, the user/site unique "cross platform embeddable javascript code" and whatever serverside code to support it. Basically this is an API that is accessed using your own javascript widget. Feel free you point to examples in your requirements document, programmers love examples.

Develop Reference

JavaScript is the programming language of the Web.