Get Html page after jQuery and Javascript execution with Python - javascript

I've imported the content of a webpage into a variable in python, but I'm not getting the final structure (the one that's modified by Ajax and jQuery in general).
How could I solve this?
I would like to get the html as the one I see if I save the page from the browser.
That's my code:
import urllib.request
urlAddress = "http:// ... /"
getPage = urllib.request.urlopen(urlAddress)
outputPage = str(getPage.read())
print(outPage)

You can't by just getting the page source from the server. You need to do one of the following:
Use the headless browser or similar solution (Selenium, Splash, PhantomJS, ...) to run the JS code in the page itself and see the results.
Figure out what the JS code actually does, and recreate the same in Python. If it's doing another call to the server, you can see that in the XHR tab in Developer Tools on Chrome.

Related

How the get the whole source html code of a webpage that is generated by Javascript using Java / Webdriver?

I am a newbie in programming and I have a task here I need to solve. I am trying to get the html source code of a webpage using Java / Webdriver method getPageSource(). Problem is, that page is somehow generated, probably by javascript, so the result I get is html code containing just page skeleton - a table that is empty, not filled by data. But, there is tag like <script type="text/javascript" src="/x/js/main.c0e805a3.js"></script> in the very bottom of that html code.
The question is, how can I force Webdriver to run that Javascript and give me the result - the whole source html with data. I already tried to use this (js.executeScript("window.location = '/x/js/main.c0e805a3.js'");) before calling getPageSource() but not successful.
Any help will be appreciated, thanks!
There are quite a few setups, now, that can run the Java-Script on a web-page. The most well known, I think, is likely Selenium since I think it has been around for a while. Others include karate, Puppeteer, and even an old tool called Rhino. Puppeteer is a Google, Inc. project that uses Java-Script (server-side Java-Script, called Node.js. They don't like us comparing, contrasting libraries here.
I haven't had the time to engage Selenium, yet, but I write HTML parser, search and update code all the time. If your only goal is to load a page whose contents are dynamically "filled in by AJAX calls" - and what I mean by that, you only want the contents of an HTML that would normally see when you visit the sites web-page, and you are not concerned with button presses then the one I have been using for that is called Splash This tool does have the ability to let you invoke Java-Script, but if all you want to do is see the JS on a page dynamically load the table, then, literally, all you have to do is start-the tool, and add one line to your program.
On Google Cloud Platform, these 2 lines will start a Splash Proxy Server. If you are writing your code on AWS (Amazon) or Azure (Microsoft), it would likely be similar. If you are running your code in an office on the local machine, you would have to research how to start it.
Install Docker. Make sure Docker version >= 17 is installed.
Pull the image:
$ sudo docker pull scrapinghub/splash
Start the container:
$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
Then, in your code, all you have to do is the following:
// If your original code looked like this:
URL url = new URL("https://en.wikipedia.org/wiki/Christopher_Columbus");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", USER_AGENT);
return new BufferedReader(new InputStreamReader(con.getInputStream()));
Change the first line of code in this example to this, and (theoretically), and dynamically loaded HTML tables that are completed with the onload page events will be automatically loaded before returning the HTML page.
// Add this line to your methods
String splashProxy = "http://localhost:8050/render.html?url=";
URL url = new URL(splashProxy + "https://en.wikipedia.org/wiki/Christopher_Columbus");
For most web-sites, any initial tables that are filled by JS/jQuery/AJAX will be filled in. If you are willing to learn teh Lua Programming Language, you can also start invoking the methods there. It has been pretty convenient for my purposes, since I am not writing web-page testing code (code that simulates user button presses). If that is what you are doing, Selenium is likely worth spending time learning / studying the A.P.I.

python javascript scrape automatically

Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.

Efficient practice to scrape a page with Client-side output?

I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.
However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.
Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?
Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.
It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.
For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs
For how to use PhantomJS with python, please take a look at this
Hope it helps~
I cannot see that string in the source
If you only need to fetch one string of the page you might program to do the same what js performs.
If JS sends ajax request (GET or POST), you also do it using pure Python thus fetching the missing string.
Suppose in-page script performs the following (NB. code might be in pure JS see here an example):
$.ajax({
url: "test.html",
context: document.body
}).done(function() {
$( this ).addClass( "done" );
});
so in your Python scripting you request the 'test.html' file:
import requests
base='http://example.com/'
r = requests.get( base + 'test.html')
thus getting the data desired:
print r.headers['content-type']
// 'application/json; charset=utf8'
print r.text
// u'{"data":"<string>"...'

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

jquery load to hide content

There is javascript on my webpage, but I need to hide it from my users (I don't want them to be able to see it because it contains some answers to the game.)
So I tried using Jquery .load in order to hide the content (I load the content from an external js file with that call). But it failed to load. So I tried ajax and it failed too.
Maybe the problem comes from the fact that I'm trying to load a file located in my root directory, while the original page is located in "root/public_html/main/pages":
<script type="text/javascript">
$(document).ready(function() {
$.ajax({
url : "../../../secret_code.js",
dataType: "text",
success : function (data) {
$("#ajaxcontent").html(data);
}
});
});
</script>
1) Why can't I load a file from the root directory with ajax or load method?
2) Is there another way around?
PS: I'm putting the file in the root directory so people can't access it directly from their browsers...
1) if the file isn't accessible via web browsers, than it's not accessible via ajax (ajax is part of the web browsers
2) try /secret_code instead of ../../../secret_code.js
What is your system setup? Are you using a CMS?
Even if you add the javascript to the page after page load a user with a tool like firebug can go and view it. I don't think what you are doing is really going to secure it. An alternate solution is that you could minify and obfuscate the javascript that you use in your production environment. This will produce near unreadable but functioning javascript code. There are a number of tools that you can run your code through to minify and obfuscate it. Here is one tool you could use: http://www.refresh-sf.com/yui/
If that isn't enough then maybe you could put the answers to the game on your serverside and pull them via ajax. I don't know your setup so I don't know if that is viable for you.
Navigate to the URL, not the directory. Like
$.ajax({
url : "http://domain.com/js/secret_code.js",
..
Even if you load your content dynamicly, it's quite easy to see content of the file using firebug, fiddler or any kind of proxy. I suggest you to use obfuscator. It will be harder for user to find answer
Take a look at the jQuery.getScript() function, it's designed for loading Javascript files over AJAX and should do what you need.
Try jQuery's $.getScript() method for loading external
Script files, however, you can easily see the contents of the script file using Firebug or the developer toolbar!
Security first
You can't access your root directory with JavaScript because people would read out your database passwords, ftp password aso. if that would be possible.
You can only load files that are accessible directly from browsers, for example, http://www.mydomain.com/secret_code.js
If it can't be accessed directly by the browser, it can't be accessed by the browser via ajax. You can however use .htaccess to prevent users from opening up a js file directly, though that doesn't keep them from looking at it in the google chrome or firebug consoles.
If you want to keep it secret, don't let it get to the browser.

Categories

Resources