Efficient practice to scrape a page with Client-side output?

Efficient practice to scrape a page with Client-side output? - javascript

I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.
However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.
Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?
Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.

It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.
For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs
For how to use PhantomJS with python, please take a look at this
Hope it helps~

I cannot see that string in the source
If you only need to fetch one string of the page you might program to do the same what js performs.
If JS sends ajax request (GET or POST), you also do it using pure Python thus fetching the missing string.
Suppose in-page script performs the following (NB. code might be in pure JS see here an example):
$.ajax({
url: "test.html",
context: document.body
}).done(function() {
$( this ).addClass( "done" );
});
so in your Python scripting you request the 'test.html' file:
import requests
base='http://example.com/'
r = requests.get( base + 'test.html')
thus getting the data desired:
print r.headers['content-type']
// 'application/json; charset=utf8'
print r.text
// u'{"data":"<string>"...'

Related

Is there a way to programmatically run a javascript function from the source of a website

On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.
Now, I could simply go to his page once a day, click the Get Table as a CSV" button, and copy and paste that into a file, but this seems like a cron job.
I could grab the raw html and parse it but the web page already has a get_csv_output(e) function in its sr-min.js file readily available. In fact, if I open up the developer console and type in
get_csv_output("injuries")
I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.
Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.
I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.
Please let me know if I'm being naive.
Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi
Edit 2: The accepted answer is an excellent solution for future reference.
I ended up doing
curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv
Where the python script is...
import sys
from bs4 import BeautifulSoup
def parse_injury_html(html_doc):
soup = BeautifulSoup(html_doc, "html.parser")
injuries_table = soup.find(id="injuries")
for row in injuries_table.tbody.find_all("tr"):
if row.get('class', None) == "thead":
continue
name = row.th
team, update, description = row.find_all("td")
yield((name.string, team.string, update.string, description.string))
def main():
for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
print(f"{name}\t{team}\t{update}\t{description}")
if __name__ == '__main__':
main()

Just executing this function won't do no good because it must be executed in context of that injuries page. If you look at its code, it effectively parses html data. Weird way of doing things but I saw worse. Nevermind.
The easiest solution will be using something that opens the page and calls the function just like you do it in devtools. Barmar suggested Selenium, but I personally prefer puppeteer. It is run via NodeJS, it opens Chrome in windowless mode and executes any open API on any site. In our case - the get_csv_output function.
After that you may do whatever you want with the result string. Dump it to DB or save to file.
An example of puppeteer code.

You could more directly just run the code in that JS function. Node.js is a standalone JS engine, so you may be able to use it to run the exact same function.
That function is most likely just making HTTP requests to download the data from a server, perhaps with some mild data manipulations. The networking layer between node and browser JS are not the same, but there are polyfills available. If the JS function is using the fetch API, you can use node-fetch, or if it's using XHR-style requests, xmlhttprequest.
Since the code is probably a simple data fetch, it might be simple enough to reverse-engineer what's going on and write your own script yourself in whatever language you prefer to make the same type of HTTP request. Watching what's going on in the network tab of your developer tools should tell you where it's getting its data.

Execute JS in server side with PHP [duplicate]

for a project at school I am trying to make a website that can show your grades in a prettier way than it's being done now.
I have been able to log in to the site using cURL and now I want to get the grades in a string so I can edit it with PHP.
The only problem is that cURL gets the html source code when it hasn't been edited by the javascript that gets the grades.
So basically I want the code that you get when you open firebug or inspector in a string so I can edit it with php.
Does anyone have an idea on how to do this? I have seen several posts that say that you have to wait till the page has loaded, but I have no clue on how to make my site wait for another third-party site to be loaded.
The code that I am waiting to be executed and of which I want the result is this:
<script type="text/javascript">
var widgetWrapper = $("#objectWrapper325");
if (widgetWrapper[0].timer !== undefined) {
clearTimeout( jQuery('#objectWrapper325')[0].timer );
}
widgetWrapper[0].timer = setTimeout( function() {
if (widgetWrapper[0].xhr !== undefined) {
widgetWrapper[0].xhr.abort();
}
widgetWrapper[0].xhr = jQuery.ajax({
type: 'GET',
url: "",
data: {
"wis_ajax": 1,
"ajax_object": 325,
'llnr': '105629'
},
success: function(d) {
var goodWidth = widgetWrapper.width();
widgetWrapper.html(d);
/* update width, needed for bug with standard template */
$("#objectWrapper325 .result__overview").css('width',goodWidth-$("#objectWrapper325 .result__subjectlabels").width());
}
});
}, 500+(Math.random()*1000));
</script>

First you have to understand a subtle but very important difference between using cURL to get a webpage, and using your browser visiting that same page.
1. Loading a page with a browser
When you enter the address on the location bar, the browser converts the url into an ip address . Then it tries to reach the web server with that address asking for a web page. From now on the browser will only speak HTTP with the web server. HTTP is a protocol made for carrying documents over network. The browser is actually asking for an html document (A bunch of text) from the web server. The web server answers by sending the web page to the browser. If the web page is a static page, the web server is just picking an html file and sending it over network. If it's a dynamic page, the web server use some high level code (like php) to generate to the web page then send it over.
Once the web page has been downloaded, the browser will then parse the page and interprets the html inside which produces the actual web page on the browser. During the parsing process, when the browser finds script tags it will interpret their content as javascript, which is a language used in browser to manipulate the look of the web page and do stuff inside the browser.
Remember, the web server only sent a web page containing html content he has no clue of what's javascript.
So when you load a web page on a browser the javascript is ONLY interpreted once it is downloaded on the browser.
2. What is cURL
If you take a look at curl man page, you'll learn that curl is a tool to transfer data from/to servers which can speak some supported protocols and HTTP is one of them.
When you download a page with curl, it will try to download the page the same way your browser does it but will not parse or interpret anything. cURL does not understand javascript or html, all it knows about is how to speak to web servers.
3. Solution
So what you need in your case is to download the page like cURL does it and also somehow make the javascript to be interpreted as if it was inside a browser.
If you had follwed me up to here then you're ready to take a look at CasperJS.

Get Html page after jQuery and Javascript execution with Python

I've imported the content of a webpage into a variable in python, but I'm not getting the final structure (the one that's modified by Ajax and jQuery in general).
How could I solve this?
I would like to get the html as the one I see if I save the page from the browser.
That's my code:
import urllib.request
urlAddress = "http:// ... /"
getPage = urllib.request.urlopen(urlAddress)
outputPage = str(getPage.read())
print(outPage)

You can't by just getting the page source from the server. You need to do one of the following:
Use the headless browser or similar solution (Selenium, Splash, PhantomJS, ...) to run the JS code in the page itself and see the results.
Figure out what the JS code actually does, and recreate the same in Python. If it's doing another call to the server, you can see that in the XHR tab in Developer Tools on Chrome.

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.

As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.

How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

How do I protect JavaScript files?

I know it's impossible to hide source code but, for example, if I have to link a JavaScript file from my CDN to a web page and I don't want the people to know the location and/or content of this script, is this possible?
For example, to link a script from a website, we use:
<script type="text/javascript" src="http://somedomain.example/scriptxyz.js">
</script>
Now, is possible to hide from the user where the script comes from, or hide the script content and still use it on a web page?
For example, by saving it in my private CDN that needs password to access files, would that work? If not, what would work to get what I want?

Good question with a simple answer: you can't!
JavaScript is a client-side programming language, therefore it works on the client's machine, so you can't actually hide anything from the client.
Obfuscating your code is a good solution, but it's not enough, because, although it is hard, someone could decipher your code and "steal" your script.
There are a few ways of making your code hard to be stolen, but as I said nothing is bullet-proof.
Off the top of my head, one idea is to restrict access to your external js files from outside the page you embed your code in. In that case, if you have
<script type="text/javascript" src="myJs.js"></script>
and someone tries to access the myJs.js file in browser, he shouldn't be granted any access to the script source.
For example, if your page is written in PHP, you can include the script via the include function and let the script decide if it's safe" to return it's source.
In this example, you'll need the external "js" (written in PHP) file myJs.php:
<?php
$URL = $_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI'];
if ($URL != "my-domain.example/my-page.php")
die("/\*sry, no acces rights\*/");
?>
// your obfuscated script goes here
that would be included in your main page my-page.php:
<script type="text/javascript">
<?php include "myJs.php"; ?>;
</script>
This way, only the browser could see the js file contents.
Another interesting idea is that at the end of your script, you delete the contents of your dom script element, so that after the browser evaluates your code, the code disappears:
<script id="erasable" type="text/javascript">
//your code goes here
document.getElementById('erasable').innerHTML = "";
</script>
These are all just simple hacks that cannot, and I can't stress this enough: cannot, fully protect your js code, but they can sure piss off someone who is trying to "steal" your code.
Update:
I recently came across a very interesting article written by Patrick Weid on how to hide your js code, and he reveals a different approach: you can encode your source code into an image! Sure, that's not bullet proof either, but it's another fence that you could build around your code.
The idea behind this approach is that most browsers can use the canvas element to do pixel manipulation on images. And since the canvas pixel is represented by 4 values (rgba), each pixel can have a value in the range of 0-255. That means that you can store a character (actual it's ascii code) in every pixel. The rest of the encoding/decoding is trivial.

The only thing you can do is obfuscate your code to make it more difficult to read. No matter what you do, if you want the javascript to execute in their browser they'll have to have the code.

Just off the top of my head, you could do something like this (if you can create server-side scripts, which it sounds like you can):
Instead of loading the script like normal, send an AJAX request to a PHP page (it could be anything; I just use it myself). Have the PHP locate the file (maybe on a non-public part of the server), open it with file_get_contents, and return (read: echo) the contents as a string.
When this string returns to the JavaScript, have it create a new script tag, populate its innerHTML with the code you just received, and attach the tag to the page. (You might have trouble with this; innerHTML may not be what you need, but you can experiment.)
If you do this a lot, you might even want to set up a PHP page that accepts a GET variable with the script's name, so that you can dynamically grab different scripts using the same PHP. (Maybe you could use POST instead, to make it just a little harder for other people to see what you're doing. I don't know.)
EDIT: I thought you were only trying to hide the location of the script. This obviously wouldn't help much if you're trying to hide the script itself.

Google Closure Compiler, YUI compressor, Minify, /Packer/... etc, are options for compressing/obfuscating your JS codes. But none of them can help you from hiding your code from the users.
Anyone with decent knowledge can easily decode/de-obfuscate your code using tools like JS Beautifier. You name it.
So the answer is, you can always make your code harder to read/decode, but for sure there is no way to hide.

Forget it, this is not doable.
No matter what you try it will not work. All a user needs to do to discover your code and it's location is to look in the net tab in firebug or use fiddler to see what requests are being made.

From my knowledge, this is not possible.
Your browser has to have access to JS files to be able to execute them. If the browser has access, then browser's user also has access.
If you password protect your JS files, then the browser won't be able to access them, defeating the purpose of having JS in the first place.

I think the only way is to put required data on the server and allow only logged-in user to access the data as required (you can also make some calculations server side). This wont protect your javascript code but make it unoperatable without the server side code

I agree with everyone else here: With JS on the client, the cat is out of the bag and there is nothing completely foolproof that can be done.
Having said that; in some cases I do this to put some hurdles in the way of those who want to take a look at the code. This is how the algorithm works (roughly)
The server creates 3 hashed and salted values. One for the current timestamp, and the other two for each of the next 2 seconds. These values are sent over to the client via Ajax to the client as a comma delimited string; from my PHP module. In some cases, I think you can hard-bake these values into a script section of HTML when the page is formed, and delete that script tag once the use of the hashes is over The server is CORS protected and does all the usual SERVER_NAME etc check (which is not much of a protection but at least provides some modicum of resistance to script kiddies).
Also it would be nice, if the the server checks if there was indeed an authenticated user's client doing this
The client then sends the same 3 hashed values back to the server thru an ajax call to fetch the actual JS that I need. The server checks the hashes against the current time stamp there... The three values ensure that the data is being sent within the 3 second window to account for latency between the browser and the server
The server needs to be convinced that one of the hashes is
matched correctly; and if so it would send over the crucial JS back
to the client. This is a simple, crude "One time use Password"
without the need for any database at the back end.
This means, that any hacker has only the 3 second window period since the generation of the first set of hashes to get to the actual JS code.
The entire client code can be inside an IIFE function so some of the variables inside the client are even more harder to read from the Inspector console
This is not any deep solution: A determined hacker can register, get an account and then ask the server to generate the first three hashes; by doing tricks to go around Ajax and CORS; and then make the client perform the second call to get to the actual code -- but it is a reasonable amount of work.
Moreover, if the Salt used by the server is based on the login credentials; the server may be able to detect who is that user who tried to retreive the sensitive JS (The server needs to do some more additional work regarding the behaviour of the user AFTER the sensitive JS was retreived, and block the person if the person, say for example, did not do some other activity which was expected)
An old, crude version of this was done for a hackathon here: http://planwithin.com/demo/tadr.html That wil not work in case the server detects too much latency, and it goes beyond the 3 second window period

As I said in the comment I left on gion_13 answer before (please read), you really can't. Not with javascript.
If you don't want the code to be available client-side (= stealable without great efforts),
my suggestion would be to make use of PHP (ASP,Python,Perl,Ruby,JSP + Java-Servlets) that is processed server-side and only the results of the computation/code execution are served to the user. Or, if you prefer, even Flash or a Java-Applet that let client-side computation/code execution but are compiled and thus harder to reverse-engine (not impossible thus).
Just my 2 cents.

You can also set up a mime type for application/JavaScript to run as PHP, .NET, Java, or whatever language you're using. I've done this for dynamic CSS files in the past.

I know that this is the wrong time to be answering this question but i just thought of something
i know it might be stressful but atleast it might still work
Now the trick is to create a lot of server side encoding scripts, they have to be decodable(for example a script that replaces all vowels with numbers and add the letter 'a' to every consonant so that the word 'bat' becomes ba1ta) then create a script that will randomize between the encoding scripts and create a cookie with the name of the encoding script being used (quick tip: try not to use the actual name of the encoding script for the cookie for example if our cookie is name 'encoding_script_being_used' and the randomizing script chooses an encoding script named MD10 try not to use MD10 as the value of the cookie but 'encoding_script4567656' just to prevent guessing) then after the cookie has been created another script will check for the cookie named 'encoding_script_being_used' and get the value, then it will determine what encoding script is being used.
Now the reason for randomizing between the encoding scripts was that the server side language will randomize which script to use to decode your javascript.js and then create a session or cookie to know which encoding scripts was used
then the server side language will also encode your javascript .js and put it as a cookie
so now let me summarize with an example
PHP randomizes between a list of encoding scripts and encrypts javascript.js then it create a cookie telling the client side language which encoding script was used then client side language decodes the javascript.js cookie(which is obviously encoded)
so people can't steal your code
but i would not advise this because
it is a long process
It is too stressful

use nwjs i think helpful it can compile to bin then you can use it to make win,mac and linux application

This method partially works if you do not want to expose the most sensible part of your algorithm.
Create WebAssembly modules (.wasm), import them, and expose only your JS, etc... workflow. In this way the algorithm is protected since it is extremely difficult to revert assembly code into a more human readable format.
After having produced the wasm module and imported correclty, you can use your code as you normallt do:
<body id="wasm-example">
<script type="module">
import init from "./pkg/glue_code.js";
init().then(() => {
console.log("WASM Loaded");
});
</script>
</body>

Develop Reference

JavaScript is the programming language of the Web.