Emulate javascript _dopostback in python, web scraping

Emulate javascript _dopostback in python, web scraping - javascript

Here LINK it is suggested that it is possible to "Figure out what the JavaScript is doing and emulate it in your Python code: " This is what I would like help doing ie my question. How do I emulate javascript:__doPostBack ?
Code from a website (full page source here LINK:
<a style="color: Black;" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvSearchResults','Page$2')">2</a>
Of course I have basically know idea where to go from here.
Thanks in advance for your help and ideas
Ok there are lots of posts asking how to CLICK a javascript button when web scraping with python libraries mechanize, beautifulsoup....,similar. I see a lot of "that is not supported" responses use THIS non python solution. I think a python solution to this problem would be of great benefit to many. In that light I am not looking for answers like use x,y or z which are not python code or require interacting with a browser.

The mechanize page is not suggesting that you can emulate JavaScript in Python. It is saying that you can change a hidden field in a form, thus tricking the web server that a human1 has selected the field. You still need to analyse the target yourself.
There will be no Python-based solution to this problem, unless you wish to create a JavaScript interpreter in Python.
My thoughts on this problem have led me to three possible solutions:
create an XULRunner application
browser automation
attempt to interpret the client-side code
Of those three, I've only really seen discussion of 2. I've seen something
close to 1 in a commercial scraping application, where you basically create
scripts by browsing on sites and selecting things on the pages that you
would like the script to extract in the future.
1 could possibly made to work with a Python script by accepting a
serialisation (JSON ?) of wsgi Request objects, getting the app to fetch the
URL, then sending the processed page as a wsgi Response object. You could
possibly wrap some middleware around urllib2 to achieve this. Overkill
probably, but kind of fun to think about.
2 is usually achieved via Selenium RC (Remote Control), a testing-centric
tool. It provides a few methods like getHtmlSource but most people that I've
heard using it get don't like its API.
3 I have no idea about. node.js is very hot right now, but I haven't
touched it. I've never been able to build spidermonkey on my Ubuntu
machine, so I haven't touched that either. My hunch is that in order to do
this, you would provide the HTML source and your details to a JS
interpreter, that would need to fake being your User-Agent etc in case the
JavaScript wanted to reconnect with the server.
1 well, more technically, a JavaScript compliant User-Agent, which is almost always a web browser used by a human

Related

Navigate between pages without changing the link

This question might have been asked in the past but I cannot seem to find the answer. I would like to navigate between pages without avec to redirect to a brand new xxx.html file. Basically, I want to keep only one html file being the index.html
In order to understand what I mean, here is a small preview of this functionality I want to achieve.
Preview
As you can see, the piece of clothing is not its individual html file. What method is used to achieve this?

What you are seeing is called a Single-Page-Application. There are a lot of frameworks with which you can create a page like this. If you are going for plain HTML/CSS/JavaScript it will be a lot harder to do correctly.

What you're seeing here is a dynamic webpage that is taking advantage of client-side technology to create this effect. To help further explain, let's quickly go over some web development terminology:
Client-Side: Code that is executed on the user's computer (in this case in their web browser).
Server-Side: Code that is executed on a server, then a response of some sort is sent to the client.
With server-side code, the value cannot change unless a new call is made to the server to get a new response. This is because the code isn't actually running on the computer the user is running, it's running on some other computer probably thousands of miles away. However, with client-side code, dynamic changes could be made in real-time because the code is actually running on the user's computer.
When it comes to server-side code, we as developers have a myriad of options. Any language that can send an HTTP response to a web browser could theoretically be used as a server side language. In 2018, that's basically every major language in existence! That being said, some popular options today include Python, Ruby, Java, and Javascript (Node JS).
When it comes to client-side code, however, we're limited by what can run in a user's web browser. In general, modern web browsers only understand Javascript. However, while the language has gotten better over the years, writing code in pure JavaScript can sometimes be cumbersome, so there are libraries that help make writing Javascript easier (such as jQuery) and there are even languages that compile down to Javascript to add new syntax and functionality (such as Typescript and Coffeescript).
If you'd like to start writing dynamic web applications, a good place to start would be to learn the basics of JavaScript. Then, maybe start learning jQuery, or front-end libraries such as Angular or React. Good luck!

You will have to use javascript for this. Either you can load all content at once and just show/hide the content you need, or you could ajax to fetch the content and then render it without page reload.

Python Beautiful Soup (HTML Parsing)

I am a beginner in in Python3.6 using BeautifulSoup to perform "web-scraping."
Once I have ran a request.get() and prettyify the output I notice that the webpage does not return the values, it would seem to be storing code which would be related to the value.
Here is the link to the webpage in specific:
http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=AngeliqueKerber&f=r1
I am trying to extract the hand which the player uses in Tennis. Highlighted Yellow from picture below:
Picture of what I am trying to obtain:
I would appreciate feedback concerning the outline of the question, if it is confusing (or non-standard) feedback such as this will help me in the future to ensure I am asking questions appropriately.

There are two options (mostly).
The first one is easier and slower - browser emulation. You just try to use the site as a normal user - with browser. There is a python module for this task - selenium. It uses specific webdriver to use browser. There are plenty of webdrivers available (for example chromedriver to use chrome). Also, there are headless solutions (PhantomJS for example).
The other way is smarter and faster - XMLHttpRequests (XHRs). Basically - site uses some hidden API to get info via JS, and you try to find out how exactly. In most cases you can use Inspect Element toolbox of your browser. Switch to the network tab of it, clear it an try to get results. Then sort it to see only XHRs. It usually returns JSON-based values that are easily converted into a python dictionary using json() method of Response object.

Here's a really great GitHub that someone made on this website, an API practically you can change/edit few things (fork it) and then use it the way you want to.
HERE
It uses Selenium webdriver but it's high quality.

Evaluate javascript on a local html file (without browser)

This is part of a project I am working on for work.
I want to automate a Sharepoint site, specifically to pull data out of a database that I and my coworkers only have front-end access to.
I FINALLY managed to get mechanize (in python) to accomplish this using Python-NTLM, and by patching part of it's source code to fix a reoccurring error.
Now, I am at what I would hope is my final roadblock: Part of the form I need to submit seems to be output of a JavaScript function :| and lo and behold... Mechanize does not support javascript. I don't want to emulate the javascript functionality myself in python because I would ideally like a reusable solution...
So, does anyone know how I could evaluate the javascript on the local html I download from sharepoint? I just want to run the javascript somehow (to complete the loading of the page), but without a browser.
I have already looked into selenium, but it's pretty slow for the amount of work I need to get done... I am currently looking into PyV8 to try and evaluate the javascript myself... but surely there must be an app or library (or anything) that can do this??

Well, in the end I came down to the following possible solutions:
Run Chrome headless and collect the html output (thanks to koenp for the link!)
Run PhantomJS, a headless browser with a javascript api
Run HTMLUnit; same thing but for Java
Use Ghost.py, a python-based headless browser (that I haven't seen suggested anyyyywhere for some reason!)
Write a DOM-based javascript interpreter based on Pyv8 (Google v8 javascript engine) and add this to my current "half-solution" with mechanize.
For now, I have decided to use either use Ghost.py or my own modification of the PySide/PyQT Webkit (how ghost works) to evaluate the javascript, as apparently they can run quite fast if you optimize them to not download images and disable the GUI.
Hopefully others will find this list useful!

Well you will need something that both understands the DOM and understand Javascript, so that comes down to a headless browser of some sort. Maybe you can take a look at the selenium webdriver, but I guess you already did that. I don't hink there is an easy way of doing this without running the stuff in an actually browser engine.

Scraping websites with Javascript enabled?

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser.
I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.
So far i've been using Mechanize and it works on websites that don't require Javascript.
Is there any way to access websites that use Javascript by using urllib2 or something similar?
I'm also willing to learn Javascript, if that's what it takes.

I wrote a small tutorial on this subject, this might help:
http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html
Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.

I've had exactly the same problem. It is not simple at all, but I finally found a great solution, using PyQt4.QtWebKit.
You will find the explanations on this webpage : http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/
I've tested it, I currently use it, and that's great !
Its great advantage is that it can run on a server, only using X, without a graphic environment.

You should look into using Ghost, a Python library that wraps the PyQt4 + WebKit hack.
This makes g the WebKit client:
import ghost
g = ghost.Ghost()
You can grab a page with g.open(url) and then g.content will evaluate to the document in its current state.
Ghost has other cool features, like injecting JS and some form filling methods, and you can pass the resulting document to BeautifulSoup and so on: soup = bs4.BeautifulSoup(g.content).
So far, Ghost is the only thing I've found that makes this kind of thing easy in Python. The only limitation I've come across is that you can't easily create more than one instance of the client object, ghost.Ghost, but you could work around that.

Check out crowbar. I haven't had any experience with it, but I was curious about the answer to your question so I started googling around. I'd like to know if this works out for you.
http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

Maybe you could use Selenium Webdriver, which has python bindings I believe. I think it's mainly used as a tool for testing websites, but I guess it should be usable for scraping too.

I would actually suggest using Selenium. Its mainly designed for testing Web-Applications from a "user perspective however it is basically a "FireFox" driver. I've actually used it for this purpose ... although I was scraping an dynamic AJAX webpage. As long as the Javascript form has a recognizable "Anchor Text" that Selenium can "click" everything should sort itself out.
Hope that helps

How do I use Mechanize to process JavaScript?

I'm connecting to a web site, logging in.
The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.
I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?

If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

At present, Mechanize doesn't handle JavaScript. There's talk of eventually merging Johnson's capabilities into Mechanize, but until that happens, you have two options:
Figure out the JavaScript well enough to understand how to traverse those pages.
Automate an actual browser that does understand JavaScript using Watir.

what are those pages for? how they can redirect to other pages. what should i do to pass these?
Sometimes work is done on those pages. Sometimes the JavaScript is there to prevent automated access like what you're trying to do :). A lot of websites have unnecessary checks to make sure you have a "good" browser, so make sure that your user_agent is set to something common, like IE. Sometimes setting the user_agent to look like an old browser will let you get past without JavaScript.
Website automation is fun because you have to outsmart the website and its software developers, using multiple strategies. Like the others said, Watir is the best tool for getting past JavaScript at the moment.

Develop Reference

JavaScript is the programming language of the Web.