Scraping websites with Javascript enabled? - javascript

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser.
I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.
So far i've been using Mechanize and it works on websites that don't require Javascript.
Is there any way to access websites that use Javascript by using urllib2 or something similar?
I'm also willing to learn Javascript, if that's what it takes.

I wrote a small tutorial on this subject, this might help:
http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html
Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.

I've had exactly the same problem. It is not simple at all, but I finally found a great solution, using PyQt4.QtWebKit.
You will find the explanations on this webpage : http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/
I've tested it, I currently use it, and that's great !
Its great advantage is that it can run on a server, only using X, without a graphic environment.

You should look into using Ghost, a Python library that wraps the PyQt4 + WebKit hack.
This makes g the WebKit client:
import ghost
g = ghost.Ghost()
You can grab a page with g.open(url) and then g.content will evaluate to the document in its current state.
Ghost has other cool features, like injecting JS and some form filling methods, and you can pass the resulting document to BeautifulSoup and so on: soup = bs4.BeautifulSoup(g.content).
So far, Ghost is the only thing I've found that makes this kind of thing easy in Python. The only limitation I've come across is that you can't easily create more than one instance of the client object, ghost.Ghost, but you could work around that.

Check out crowbar. I haven't had any experience with it, but I was curious about the answer to your question so I started googling around. I'd like to know if this works out for you.
http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

Maybe you could use Selenium Webdriver, which has python bindings I believe. I think it's mainly used as a tool for testing websites, but I guess it should be usable for scraping too.

I would actually suggest using Selenium. Its mainly designed for testing Web-Applications from a "user perspective however it is basically a "FireFox" driver. I've actually used it for this purpose ... although I was scraping an dynamic AJAX webpage. As long as the Javascript form has a recognizable "Anchor Text" that Selenium can "click" everything should sort itself out.
Hope that helps

Related

How to bypass JavaScript detection when using requests in Python?

So there is a problem with JavaScript and requests (in Python) and that is, it does not use JavaScript when requesting a webpage.
The website I'm working with (https://access.paylocity.com/) requires JavaScript and without it, it changes the content of the page to just a text at the top saying, "Please enable JavaScript to view the page content."
(I could be wrong here but) I think one solution is the use of Selenium, but that would replace requests which I'm fine with as long as there are no other ways of fixing/bypassing this JavaScript detection.
(For those wondering, this python project of mine is supposed to automatically fetch the events on the Paylocity calendar, then port those events to another calendar that I frequently use everyday. It's also just intended for myself.)
Edit: Here is the code I have if that will help https://pastecode.io/s/GXTUO1BgtR (I didn't know where to paste my code, so I decided on that website. If I should change it, please comment or say something about it.)
Since the website you're working with is dynamically loading the JS as far I can tell, I think you have no other choice as to making use of Selenium. I had a project on my own a couple weeks ago and run into a similar problem which I could also solve using Selenium. But, I'm no expert, I'm just giving away my thoughts on this.

Call JavaScript (3rd party library) from Python

I've already searched quite a bit but came to now clear conclusion as some projects (pyv8) seem to be dead and I'm not sure if that is suitable at all. The 3rd part lib requires a DOM, eg. a container element in which it runs. It also uses web assembly and in general is pretty heavy.
Not sure if libs like pyv8 would actually be suitable for that. Other approach would be to go with selenium and headless chrome or a local node.js service but both of these sound very heavy. Oh, and the lib must work in windows as that's simply company policy, windows servers so PyMiniRacer is out.
What are my other options?
Consider taking a look at this post: How do I call a Javascript function from Python?.
However, if your objective is to access JS code in a webpage for reasons such as webscraping, you could also consider using selenium webdriver + python to do so. Take a look at this medium.com post: How to Run JavaScript in Python | Web Scraping | Web Testing
Other Resources:
https://www.quora.com/How-do-we-use-JavaScript-with-Python
Python to JS: https://pypi.org/project/javascripthon/
P.S: I am not sure if this would help you. There is another library (PyExecJS) which is maintained no longer; but I think you have looked it up already.

Evaluate javascript on a local html file (without browser)

This is part of a project I am working on for work.
I want to automate a Sharepoint site, specifically to pull data out of a database that I and my coworkers only have front-end access to.
I FINALLY managed to get mechanize (in python) to accomplish this using Python-NTLM, and by patching part of it's source code to fix a reoccurring error.
Now, I am at what I would hope is my final roadblock: Part of the form I need to submit seems to be output of a JavaScript function :| and lo and behold... Mechanize does not support javascript. I don't want to emulate the javascript functionality myself in python because I would ideally like a reusable solution...
So, does anyone know how I could evaluate the javascript on the local html I download from sharepoint? I just want to run the javascript somehow (to complete the loading of the page), but without a browser.
I have already looked into selenium, but it's pretty slow for the amount of work I need to get done... I am currently looking into PyV8 to try and evaluate the javascript myself... but surely there must be an app or library (or anything) that can do this??
Well, in the end I came down to the following possible solutions:
Run Chrome headless and collect the html output (thanks to koenp for the link!)
Run PhantomJS, a headless browser with a javascript api
Run HTMLUnit; same thing but for Java
Use Ghost.py, a python-based headless browser (that I haven't seen suggested anyyyywhere for some reason!)
Write a DOM-based javascript interpreter based on Pyv8 (Google v8 javascript engine) and add this to my current "half-solution" with mechanize.
For now, I have decided to use either use Ghost.py or my own modification of the PySide/PyQT Webkit (how ghost works) to evaluate the javascript, as apparently they can run quite fast if you optimize them to not download images and disable the GUI.
Hopefully others will find this list useful!
Well you will need something that both understands the DOM and understand Javascript, so that comes down to a headless browser of some sort. Maybe you can take a look at the selenium webdriver, but I guess you already did that. I don't hink there is an easy way of doing this without running the stuff in an actually browser engine.

python with javascript

I am working on parsing a html page.
I tried spynner, selenium, mechanize but didnt able to solve issue with javascript with this case.
Can anyone let me know how can i work with such issue to fetch the data to next page?
when I worked on selenium, in this url first we have to get the data in other select box and then proceed but using selenium I am able to get only same url after click on the next page,
same incase of spynner too.
From what I can tell, mechanize doesn't support javascript. So, if you're doing any automation with javascript heavy sites, mechanize is probably not the way to go. Rather, you probably need python to script a fully functional web browser. You can do this with Mozilla via PyXPCOM, with Ruby and WATIR, or with spynner. Of these options, I'd probably try spynner, first, as spynner is well integrated with python.
Good luck with your project, and happy coding!

Emulate javascript _dopostback in python, web scraping

Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ?
Code from a website (full page source here LINK:
<a style="color: Black;" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvSearchResults','Page$2')">2</a>
Of course I have basically know idea where to go from here.
Thanks in advance for your help and ideas
Ok there are lots of posts asking how to CLICK a javascript button when web scraping with python libraries mechanize, beautifulsoup....,similar. I see a lot of "that is not supported" responses use THIS non python solution. I think a python solution to this problem would be of great benefit to many. In that light I am not looking for answers like use x,y or z which are not python code or require interacting with a browser.
The mechanize page is not suggesting that you can emulate JavaScript in Python. It is saying that you can change a hidden field in a form, thus tricking the web server that a human1 has selected the field. You still need to analyse the target yourself.
There will be no Python-based solution to this problem, unless you wish to create a JavaScript interpreter in Python.
My thoughts on this problem have led me to three possible solutions:
create an XULRunner application
browser automation
attempt to interpret the client-side code
Of those three, I've only really seen discussion of 2. I've seen something
close to 1 in a commercial scraping application, where you basically create
scripts by browsing on sites and selecting things on the pages that you
would like the script to extract in the future.
1 could possibly made to work with a Python script by accepting a
serialisation (JSON ?) of wsgi Request objects, getting the app to fetch the
URL, then sending the processed page as a wsgi Response object. You could
possibly wrap some middleware around urllib2 to achieve this. Overkill
probably, but kind of fun to think about.
2 is usually achieved via Selenium RC (Remote Control), a testing-centric
tool. It provides a few methods like getHtmlSource but most people that I've
heard using it get don't like its API.
3 I have no idea about. node.js is very hot right now, but I haven't
touched it. I've never been able to build spidermonkey on my Ubuntu
machine, so I haven't touched that either. My hunch is that in order to do
this, you would provide the HTML source and your details to a JS
interpreter, that would need to fake being your User-Agent etc in case the
JavaScript wanted to reconnect with the server.
1 well, more technically, a JavaScript compliant User-Agent, which is almost always a web browser used by a human

Categories

Resources