Python BeautifulSoup on javascript tables with multiple pages - javascript

I used to have a python script that pulled data from the below table properly using Mechanize and BeautifulSoup. However, this site has recently changed the encoding of the table to javascript, and I'm having trouble working with it because there are multiple pages to the table.
http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=&ind=0&team=25&players=0
For example, in the link above, how could I grab the data from both page 1 and page 2 of the table? FWIW, The URL doesn't change.

Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want, parsing html using BeautifulSoup is cool for a while but is headache in long term. So why scrape when you can access the DOM

Mechanize doesn't handle javascript.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey or
Try using Selenium.
Selenium is a funcitonal testing framework which automates the browser to perform certain operations which in turn test the underlying actions of your code

Related

How Can I Scrape Data From Websites Don't Return Simple HTML

I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
Thanks so much!
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.
Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.
This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.

How to get the result of a javascript function from a python code using Beautiful Soup?

I want to scrape data from a website using Beautiful Soup in Python. The site changes the values of a drop down menu based on selection by user. There is no api call in changing the values of drop down menu. On taking a closer look, I observed there is one javascript function which is called internally to get the values of drop down menu. My problem is values of that drop down menu are not there in page source. They are got by calling that js function but sice there is no api call, I can't request that values. Can anyone tell me how can I call a javascript function from a python code. I'm using the Beautiful Soup for web scraping.
Thanks
You might be interested in the Pyv8 module; it lets you embed a javascript interpreter in Python code, but does not include a browser DOM. I give a short example in Why is BeautifulSoup not finding a specific table class?
For javascript that makes more extensive use of browser features, you may prefer ghost.py, a headless Webkit-based browser with a Python API.
Failing that, if you gave the page url, we could take a look at the javascript and see if there's a quick way to duplicate the call in Python.
You can't. BeautifulSoup is an HTML parser.
You want to do more than parse HTML; you want to evaluate Javascript.
Perhaps you are looking for a Javascript-capable browser, like Selenium.
Beautiful Soup can't be used for parsing javascript loaded content. You should use something like Selenium

Web scraping a website with dynamic javascript content

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are basically two main options to proceed with:
using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.
Hope that helps.

Evaluate javascript on a local html file (without browser)

This is part of a project I am working on for work.
I want to automate a Sharepoint site, specifically to pull data out of a database that I and my coworkers only have front-end access to.
I FINALLY managed to get mechanize (in python) to accomplish this using Python-NTLM, and by patching part of it's source code to fix a reoccurring error.
Now, I am at what I would hope is my final roadblock: Part of the form I need to submit seems to be output of a JavaScript function :| and lo and behold... Mechanize does not support javascript. I don't want to emulate the javascript functionality myself in python because I would ideally like a reusable solution...
So, does anyone know how I could evaluate the javascript on the local html I download from sharepoint? I just want to run the javascript somehow (to complete the loading of the page), but without a browser.
I have already looked into selenium, but it's pretty slow for the amount of work I need to get done... I am currently looking into PyV8 to try and evaluate the javascript myself... but surely there must be an app or library (or anything) that can do this??
Well, in the end I came down to the following possible solutions:
Run Chrome headless and collect the html output (thanks to koenp for the link!)
Run PhantomJS, a headless browser with a javascript api
Run HTMLUnit; same thing but for Java
Use Ghost.py, a python-based headless browser (that I haven't seen suggested anyyyywhere for some reason!)
Write a DOM-based javascript interpreter based on Pyv8 (Google v8 javascript engine) and add this to my current "half-solution" with mechanize.
For now, I have decided to use either use Ghost.py or my own modification of the PySide/PyQT Webkit (how ghost works) to evaluate the javascript, as apparently they can run quite fast if you optimize them to not download images and disable the GUI.
Hopefully others will find this list useful!
Well you will need something that both understands the DOM and understand Javascript, so that comes down to a headless browser of some sort. Maybe you can take a look at the selenium webdriver, but I guess you already did that. I don't hink there is an easy way of doing this without running the stuff in an actually browser engine.

analyse a .aspx site with paging with __doPostBack function

I want to analyse some data of a webpage, but here's the problem: the site has more pages which gets called with a __doPostBack function.
How can I "simulate" to go a page further and analyse this site, and so on..
At this time I analyse the data with JSoup in java - but I'm open to use some other language if it's necessary.
A postback-based system (.NET, Prado/PHP, etc) works in a manner that it keeps a complete snapshot of the browser contents on the server side. This is called a pagestate. Any attempt to manipulate with a client that is not JavaScript-capable is almost sure to fail.
What you need is a JavaScript-capable browser. The easiest solution I found is to use the framework Firefox is written in - XUL - to create such a desktop application. What you do is basically create a desktop application with a single browser element in it, which you can then script from the application itself without restrictions of the security container. Alternatively, you could also use the Greasemonkey plugin to do your bidding. The latter is a bit easier to get started with, but it's fairly limited since it's running on a per-page basis.
With both solutions you then have access to the page's DOM to gather data and you can also fire events (like clicking on a button). Unfortunately you have to learn JavaScript for this to work.
I used an automation library which is Selenium, which you can use in a lot of languages (C#, Java, Perl,...)
For more information how to start this link is very helpful: this.
As well as Selenium, you can use http://watin.org/

Categories

Resources