This is part of a project I am working on for work.
I want to automate a Sharepoint site, specifically to pull data out of a database that I and my coworkers only have front-end access to.
I FINALLY managed to get mechanize (in python) to accomplish this using Python-NTLM, and by patching part of it's source code to fix a reoccurring error.
Now, I am at what I would hope is my final roadblock: Part of the form I need to submit seems to be output of a JavaScript function :| and lo and behold... Mechanize does not support javascript. I don't want to emulate the javascript functionality myself in python because I would ideally like a reusable solution...
So, does anyone know how I could evaluate the javascript on the local html I download from sharepoint? I just want to run the javascript somehow (to complete the loading of the page), but without a browser.
I have already looked into selenium, but it's pretty slow for the amount of work I need to get done... I am currently looking into PyV8 to try and evaluate the javascript myself... but surely there must be an app or library (or anything) that can do this??
Well, in the end I came down to the following possible solutions:
Run Chrome headless and collect the html output (thanks to koenp for the link!)
Run PhantomJS, a headless browser with a javascript api
Run HTMLUnit; same thing but for Java
Use Ghost.py, a python-based headless browser (that I haven't seen suggested anyyyywhere for some reason!)
Write a DOM-based javascript interpreter based on Pyv8 (Google v8 javascript engine) and add this to my current "half-solution" with mechanize.
For now, I have decided to use either use Ghost.py or my own modification of the PySide/PyQT Webkit (how ghost works) to evaluate the javascript, as apparently they can run quite fast if you optimize them to not download images and disable the GUI.
Hopefully others will find this list useful!
Well you will need something that both understands the DOM and understand Javascript, so that comes down to a headless browser of some sort. Maybe you can take a look at the selenium webdriver, but I guess you already did that. I don't hink there is an easy way of doing this without running the stuff in an actually browser engine.
Related
I am working on a scrapy app to scrapte some data on a web page
But there is some data loaded by ajax, and thus python just cannot execute that to get the data.
Is there any lib that simulate the behavior of a browser?
For that you'd have to use a full-blown Javascript engine (like Google V8 in Chrome), to get the real functionality of the browser and how it interacts. However, you could possibly get some information by looking up all URLs in the source and doing a request to each, hoping for some valid data. But in overall, you're stuck without a full Javascript engine.
Something like python-spidermonkey. A wrapper to the Javascript engine of Mozilla. However using it might be rather complicated, but that's dependant on your specific application.
You'd basically have to build a browser, but seems Python-people have made it simple. With PyWebkitGtk you'd get the dom and using either python-spidermonkey mentioned before or PyV8 mentioned by Duncan you'd theoretically get the full functionality needed for a browser/webscraper.
The problem is that you don't just have to be able to execute some Javascript (that's easy), you also have to emulate the browser DOM and that's a lot of work.
If you want to be able to run Javascript then you can use PyV8. Install it with easy_install PyV8 and then you can execute any standalone javascript:
>>> import PyV8
>>> ctxt = PyV8.JSContext()
>>> ctxt.enter()
>>> ctxt.eval("(function(a,b) { return [a+b, a*b, a/b, a-b] })(13,29)")
<_PyV8.JSArray object at 0x01F26A30>
>>> list(_)
[42, 377, 0.4482758620689655, -16]
You can also pass in classes defined in Python, so in principle might be able could emulate enough of the DOM for your purposes.
An AJAX request is a normal web request which is executed asynchronously. All you need is the URL which the JavaScript code sends to the server. Use that URL with urllib to get at the same data.
The simplest way to get work done is by using 'PyExecJS'.
https://pypi.python.org/pypi/PyExecJS
PyExecJS is a porting of ExecJS from Ruby. PyExecJS automatically picks the best runtime available to evaluate your JavaScript program, then returns the result to you as a Python object.
I use Macbook and installed node.js so pyexecjs can use node.js javascript runtime.
pip install PyExecJS
Test code:
import execjs
execjs.eval("'red yellow blue'.split(' ')")
Good luck!
A little update for 2020
i reviewed the results recommended by Google Search. Turns out the best choice is #4 and then #1 #2 are deprecated!
#4 https://github.com/sqreen/PyMiniRacer is actually the most straight forward installation.
#3 https://github.com/kovidgoyal/dukpy is based lightweight JS engine. I could not find any limitations compared with v8. No significant benefit in terms of performance either.
Deprecated
#1 https://github.com/sony/v8eval has received very little maintenance for 2 years. takes 20 minutes to build and fails... (filed a bug report for that https://github.com/sony/v8eval/issues/34
#2 https://github.com/doloopwhile/PyExecJS
is discontinued by the owner
I would like to setup a simple web browser that download a html page , parse it, generate a dom and execute the javascript code. I would like to know if there is a simple project(so not firefox which is good but too big to just understand this piece of logic) showing if it is the right way to handle this or someone to explain me if i am missing something. No particular language( but preferably be python, c#/c++/c ). I am stuck now at integrating the javascript engine, i don't know what to do.
Thx
I don't think its easy to pull off a javascript engine on your own. You could however use an open source engine (like WebKit's JS engine for example) and integrate it in your project.
More Infos:
http://www.webkit.org
google chrome is open source too with a neat javascript engine v8.
http://code.google.com/chromium/
http://code.google.com/p/v8/
another way could be nodejs. it's server side javascript using the v8 engine. so there is no rendering, just pure javascript. maybe thats enough if you do not need the rendering.
http://nodejs.org/
You might want to use the WebBrowser class from .NET for that purpose.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx
Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ?
Code from a website (full page source here LINK:
<a style="color: Black;" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvSearchResults','Page$2')">2</a>
Of course I have basically know idea where to go from here.
Thanks in advance for your help and ideas
Ok there are lots of posts asking how to CLICK a javascript button when web scraping with python libraries mechanize, beautifulsoup....,similar. I see a lot of "that is not supported" responses use THIS non python solution. I think a python solution to this problem would be of great benefit to many. In that light I am not looking for answers like use x,y or z which are not python code or require interacting with a browser.
The mechanize page is not suggesting that you can emulate JavaScript in Python. It is saying that you can change a hidden field in a form, thus tricking the web server that a human1 has selected the field. You still need to analyse the target yourself.
There will be no Python-based solution to this problem, unless you wish to create a JavaScript interpreter in Python.
My thoughts on this problem have led me to three possible solutions:
create an XULRunner application
browser automation
attempt to interpret the client-side code
Of those three, I've only really seen discussion of 2. I've seen something
close to 1 in a commercial scraping application, where you basically create
scripts by browsing on sites and selecting things on the pages that you
would like the script to extract in the future.
1 could possibly made to work with a Python script by accepting a
serialisation (JSON ?) of wsgi Request objects, getting the app to fetch the
URL, then sending the processed page as a wsgi Response object. You could
possibly wrap some middleware around urllib2 to achieve this. Overkill
probably, but kind of fun to think about.
2 is usually achieved via Selenium RC (Remote Control), a testing-centric
tool. It provides a few methods like getHtmlSource but most people that I've
heard using it get don't like its API.
3 I have no idea about. node.js is very hot right now, but I haven't
touched it. I've never been able to build spidermonkey on my Ubuntu
machine, so I haven't touched that either. My hunch is that in order to do
this, you would provide the HTML source and your details to a JS
interpreter, that would need to fake being your User-Agent etc in case the
JavaScript wanted to reconnect with the server.
1 well, more technically, a JavaScript compliant User-Agent, which is almost always a web browser used by a human
I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser.
I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.
So far i've been using Mechanize and it works on websites that don't require Javascript.
Is there any way to access websites that use Javascript by using urllib2 or something similar?
I'm also willing to learn Javascript, if that's what it takes.
I wrote a small tutorial on this subject, this might help:
http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html
Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.
I've had exactly the same problem. It is not simple at all, but I finally found a great solution, using PyQt4.QtWebKit.
You will find the explanations on this webpage : http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/
I've tested it, I currently use it, and that's great !
Its great advantage is that it can run on a server, only using X, without a graphic environment.
You should look into using Ghost, a Python library that wraps the PyQt4 + WebKit hack.
This makes g the WebKit client:
import ghost
g = ghost.Ghost()
You can grab a page with g.open(url) and then g.content will evaluate to the document in its current state.
Ghost has other cool features, like injecting JS and some form filling methods, and you can pass the resulting document to BeautifulSoup and so on: soup = bs4.BeautifulSoup(g.content).
So far, Ghost is the only thing I've found that makes this kind of thing easy in Python. The only limitation I've come across is that you can't easily create more than one instance of the client object, ghost.Ghost, but you could work around that.
Check out crowbar. I haven't had any experience with it, but I was curious about the answer to your question so I started googling around. I'd like to know if this works out for you.
http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
Maybe you could use Selenium Webdriver, which has python bindings I believe. I think it's mainly used as a tool for testing websites, but I guess it should be usable for scraping too.
I would actually suggest using Selenium. Its mainly designed for testing Web-Applications from a "user perspective however it is basically a "FireFox" driver. I've actually used it for this purpose ... although I was scraping an dynamic AJAX webpage. As long as the Javascript form has a recognizable "Anchor Text" that Selenium can "click" everything should sort itself out.
Hope that helps
I'm looking for a way to write a non-GUI bot using Mozilla Framework. The bot should be able to work like normal browser (automatically download relevant JS files, make XMLHTTPRequests, run JS operations, modify DOM), except no GUI will be needed.
I wonder if it is possbile to build XULRunner without X, GTK/KDE (without any GUI dependencies), as I will run the bot on FreeBSD server 6.4.
It may sound a bit weird but I need a bot with capacity to operate like browser, runs JS, modifies DOM, submit forms running on non-GUI environments.
I've looked into other browsers such as Lynx, Links, Hulahop, Chrome V8 engine, WebKit JavascriptCore but yet to find desirable output.
It's a part of school project, thesis. We will use to observe price change of budget airlines and after one year long data collection, we need to deduce pricing strategy and customer behavior. It is a serious Final Year Project.
Any hint or help is greatly appreciated! Thank you in advance!
Regards.
You should be able to make progress with selenium. It's a record/test/play tool but its core is manipulating the DOM.
Update from Grundlefleck's comment: As for launching the actual tests there is selenium remote-control, which allows you to write your tests in Java, Ruby, plain HTML and other possible drivers.
Yes, it is possible (but it might very well require LOTS of code changes).
No, I do not know any of the details.
I would not recommend this approach for your purposes. From your comment, it sounds like you are trying to scrape webpages. If you really need to use JavaScript, you can use a stand-alone JavaScript-engine (Mozilla's is available here). Otherwise, I would use Beautiful Soup with Python or Twill. You might also want to read this question.