state of HTML after onload javascript - javascript

many webpages use onload JavaScript to manipulate their DOM. Is there a way I can automate accessing the state of the HTML after these JavaScript operations?
A took like wget is not useful here because it just downloads the original source.
Is there perhaps a way to use a web browser rendering engine?
Ideally I am after a solution that I can interface with from Python.
thanks!

The only good way I know to do such things is to automate a browser, for example via Selenium RC. If you have no idea of how to deduce that the page has finished running the relevant javascript, then, just a real live user visiting that page, you'll just have to wait a while, grab a snapshot, wait some more, grab another, and check there was no change between them to convince yourself that it's really finished.

Please see related info at stackoverflow:
screen-scraping
Screen Scraping from a web page with a lot of Javascript

Related

how to get javascript generated content from webpage directly?

I am currently using selenium with PhantomJS to scrape javascript generated content from a web page. While this does get me the results I am looking for, it is a slow approach as I need to wait for the page to load before scraping. Is there a way to directly run the javascript that generates the content I am looking for? If there is, will it be a faster approach than I am currently using?
Thanks!
Unfortunately, there is not. I've encountered this problem several times and the only solution I've come up with is to approach the problem the way you are already doing.
Being the content js-generated, the only way to fetch it is to get it from a browser, therefore using selenium with whichever driver you prefer.

How Can I Scrape Data From Websites Don't Return Simple HTML

I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
Thanks so much!
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.
Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.
This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.

HttpClient with Selenium or HTMLUNIT?

Alright, so I'm in a small pickle. I'm running into issues with JSoup since the page needs Javascript to to finish loading some of the page. Fortunately, I've worked around it in the past (parsed the raw javascript code), and it's very tedious. As of late, I tried to make a program to login to a website but it requires a token from a element. That form element is not visible unless JavaScript is executed, so it wont show up at all for me to even extract. So I decided to look into Selenium.
First question, is this the library I should be looking into? The reason why I'm so bent on using HttpClient is because some of these websites are very high in traffic and doesn't load up all the way BUT I don't need these pages to load up all the way. I just need it to load up enough to where I can retrieve the login token. I prefer to communicate with the webserver with raw JSON/POST methods once I discover the the methods required vs. having Selenium automate a click/wait/type sequence.
Basically, I only need selenium to load up 1/4 of the page, just to retrieve request tokens. The rest of my program will send POST methods using HttpClient.
Or should I just let selenium do all the work? My goal is speed. I need to login, purchase an item fast.
Edit: Actually, I might go with HtmlUnit because it's very minimal. I only need to scrape information, and I don't want to run Selenium's StandAlone Server. Is this the better approach?
Basically, HtmlUnit is quicker than Selenium so if you are going for speed you should use that. Anyway, keep in mind that Selenium has its own implementation of HtmlUnitDriver. So, as another option, you could use Selenium with HtmlUnit. The difference between them is that HtmlUnit is a browser itself without GUI, meanwhile Selenium works calling browsers feature. You may want to take a look at this other question for further details: Selenium vs HtmlUnit?

Web scraping a website with dynamic javascript content

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.

analyse a .aspx site with paging with __doPostBack function

I want to analyse some data of a webpage, but here's the problem: the site has more pages which gets called with a __doPostBack function.
How can I "simulate" to go a page further and analyse this site, and so on..
At this time I analyse the data with JSoup in java - but I'm open to use some other language if it's necessary.
A postback-based system (.NET, Prado/PHP, etc) works in a manner that it keeps a complete snapshot of the browser contents on the server side. This is called a pagestate. Any attempt to manipulate with a client that is not JavaScript-capable is almost sure to fail.
What you need is a JavaScript-capable browser. The easiest solution I found is to use the framework Firefox is written in - XUL - to create such a desktop application. What you do is basically create a desktop application with a single browser element in it, which you can then script from the application itself without restrictions of the security container. Alternatively, you could also use the Greasemonkey plugin to do your bidding. The latter is a bit easier to get started with, but it's fairly limited since it's running on a per-page basis.
With both solutions you then have access to the page's DOM to gather data and you can also fire events (like clicking on a button). Unfortunately you have to learn JavaScript for this to work.
I used an automation library which is Selenium, which you can use in a lot of languages (C#, Java, Perl,...)
For more information how to start this link is very helpful: this.
As well as Selenium, you can use http://watin.org/

Categories

Resources