How to scrape websites from within browser?

How to scrape websites from within browser? - javascript

I would like to scrape a website by just running code in a browser. In this case, the scraper has to run on a specific machine, and I cannot install any software on that machine. However, there is already a browser installed (recent version of Firefox), and I can configure the browser however I want.
What I would like is a javascript solution for scraping, contained in a webpage on site A, that can scrape site B. It seems like this would run into some CORS-type problems; I assume that part of the solution is to disable any cross-origin checks in the browser.
What I have tried so far: I looked up "web scraping in javascript", this brings up a lot of stuff intended to run in nodejs with cheerio for example this tutorial, and also stuff like pjscrape which requires PhantomJS. However, I couldn't find anything equivalent that is intended to run in a browser.
P.S. This is interesting: Firefox setting to enable cross domain ajax request Apparently Chrome --disable-web-security takes care of the cross-origin/cross-domain issues. Firefox equivalent?
P.S. Looks like ForceCORS extension to Firefox is also useful: http://www-jo.se/f.pfleger/forcecors I'm not sure if I'll be able to install that though.
P.S. This is a nice collection of ways to allow cross-domain in different browsers: http://romkey.com/2011/04/23/getting-around-same-origin-policy-in-web-browsers/ Sadly, the suggested Firefox solution doesn't work in versions >=5.

edit: looks like import.io service shut down and the url points to something completely different now. consider this answer obsolete.
try to do it with import.io: ( basically a scraping service with REST API)
as soon as i have a example javascript call to the API i can provide it. Or you check the docs yourself.
Import.io allows you to structure the data you find on webpages into rows and columns, using simple point and click technology.
First you locate your data: navigate to a website using our browser (download it from us here: http://import.io).
Then, enter our dedicated data extraction workflow by clicking the pink IO button in the top right of the Browser.
We will guide you through structuring the data on the page. You teach import.io how to extract the data by showing us examples of where the data is. We create learning algorithms that generalize from these examples to work out how to get all the data on the website.
The data you collect is stored on our cloud servers to be downloaded and shared. And every time you publish to our platform we create an API to get the data programatically so you can easily integrate live web data into your applications or third party analytics and visualization software.
EDIT:
If the data recognition works in the browser you can simply access the data by heading to "simple API integration" and Copy the url
the url u can paste here:
function reqListener () {
console.log(JSON.parse(this.responseText));
return JSON.parse(this.responseText);
}
var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("GET", "yourUrlFromClipboardComesHere", true);
oReq.send();
xhr request source

Related

Scrape website after form submit and data is loaded

I have to scrape a website which i've reviewed and i realised that i don't need to submit any form. I have the needed urls to get the data.
I'm using NodeJs and Phantom.
My problems source is something related with the session or cookies (i think).
In my web browser i can enter in this link https://www.infosubvenciones.es/bdnstrans/GE/es/convocatorias, hit on the form blue button with text "Procesar consulta". The table below will be filled. In dev tools on network tab you can see a XHR request with a link similar to https://www.infosubvenciones.es/bdnstrans/busqueda?type=convs&_search=false&nd=1594848133517&rows=50&page=1&sidx=4&sord=desc, if you open it in a new tab, the data is displayed. But if you open that link in other web browser you get 0 results.
That's exactly what is happening to me with NodeJs and Phantom and i don't know how to fix it.

If you want to give Scrapy a try, https://docs.scrapy.org/en/latest/topics/dynamic-content.html explains how to deal with this type of scenarios, and I would suggest reading it after completing the tutorial.
The page can also be handy if you use other scraping framework, as there’s not much that is Scrapy-specific, and for Python-specific stuff I’m sure there will be JavaScript counterparts.
As for Cheerio and Phantom, I’m not familiar with them, but it is most likely doable with them as well.
It’s doable with any web client, it’s just a matter of knowing how to use the tool for this purpose. Most of the work involves using your web browser tools to understand how the website works underneath.

Is it possible to change script running clientside on a webpage?

So I'm playing a game online on my laptop and it is a pretty simple html5 game but the optimization is nonexistant. The game with a few circles on my screen is using 85% of my cpu.
Logically, I started profiling the game and was trying to figure out if I could optimize it for my laptop. Now, what I'm wondering is how could I run the game but with a tweaked version of the JS script that is running.
If I try to save the page and run it from my laptop it of course has CORS issues so I can not access the server.
Can I somehow change the script a bit which is executing in my browser but while still staying "inside the webpage" which is running so that I may normally make XHR requests to the server?
Also, although this is not in the title of the question, can I somehow proxy the XHR request to my script but while not breaking the CORS rule?
Why is it so different if I run the same thing from my browser and from a saved HTML on my desktop? I have the same IP and am doing the same thing but from the url it "feels like" I'm running it from somewhere else. Can I somehow imitate that I'm running "from the webpage" but instead run it from a modified saved html?

You could proxy, given there isn't a cross domain protection mechanism or some sort of logging in (which complicates stuff).
What you could very well do is use a browser extension which allows you to add CSS, HTML and JavaScript.
I'm not entirely savvy on extensions so I'm not sure you can modify existing code but I'm guessing that if you can add arbitrary JS code you may very well replace the script tag containing the game for a similar personally modified script based on it. It's worth a try...
Link to getting started with chrome extensions
Update:
If you're set on doing it, proxying ammounts to requesting an URL with your application and do something with the page (html) instead of the original source. I assume you want to change the page and serve it to your browser.
With this in mind you will need the following, I dont know C# so you'll have to google around for libraries and utilities:
a way to request URLs (see link at bottom)
a way to modify the page, you need a DOM crawler
a way to start said process and serve it to your browser by hitting your own URL, meaning you need some sort of web server
I found the following question specifically on proxying with C#

How to develop Chrome extension to check web sites periodically?

I would like to make a small application to check special offers on some web site. This application should access this site periodically (once every few hours), parse the HTML to find the offer and notify me about the offer somehow.
I would like to develop it in JavaScript as a Chrome extension. Do you know about any examples of such an extension I can learn from?

Chrome extensions have the features available that you're after:
Request permission to the website you want to fetch.
Make a background page with a setInterval that makes an ajax request to the website and checks the contents.
Use notifications to notify the user of an update. After notifying, store the contents locally so you know when the live contents have been updated again.

Instead of making an extension, would it not be better to just subscribe to a particular website's RSS feed? You can download a nifty extension called RSS Live Links to give you updates, and you can subscribe to their offers feed.

How to check the authenticity of a Chrome extension?

The Context:
You have a web server which has to provide an exclusive content only if your client has your specific Chrome extension installed.
You have two possibilities to provide the Chrome extension package:
From the Chrome Web Store
From your own server
The problem:
There is a plethora of solutions allowing to know that a Chrome extension is installed:
Inserting an element when a web page is loaded by using Content Scripts.
Sending specific headers to the server by using Web Requests.
Etc.
But there seems to be no solution to check if the Chrome extension which is interacting with your web page is genuine.
Indeed, as the source code of the Chrome extension can be viewed and copied by anyone who want to, there seems to be no way to know if the current Chrome extension interacting with your web page is the one you have published or a cloned version (and maybe somewhat altered) by another person.
It seems that you are only able to know that some Chrome extension is interacting with your web page in an "expected way" but you cannot verify its authenticity.
The solution?
One solution may consist in using information contained in the Chrome extension package and which cannot be altered or copied by anyone else:
Sending the Chrome extension's ID to the server? But how?
The ID has to be sent by you and your JavaScript code and there seems to be no way to do it with an "internal" Chrome function.
So if someone else just send the same ID to your server (some kind of Chrome extension's ID spoofing) then your server will consider his Chrome extension as a genuine one!
Using the private key which served when you packaged the application? But how?
There seems to be no way to access or use in any way this key programmatically!
One other solution my consist in using NPAPI Plugins and embed authentication methods like GPG, etc. But this solution is not desirable mostly because of the big "Warning" section of its API's doc.
Is there any other solution?
Notes
This question attempts to raise a real security problem in the Chrome extension's API: How to check the authenticity of your Chrome extension when it comes to interact with your services.
If there are any missing possibilities, or any misunderstandings please feel free to ask me in comments.

I'm sorry to say but this problem as posed by you is in essence unsolvable because of one simple problem: You can't trust the client. And since the client can see the code then you can't solve the problem.
Any information coming from the client side can be replicated by other means. It is essentially the same problem as trying to prove that when a user logs into their account it is actually the user not somebody else who found out or was given their username and password.
The internet security models are built around 2 parties trying to communicate without a third party being able to imitate one, modify or listen the conversation. Without hiding the source code of the extension the client becomes indistinguishable from the third party (A file among copies - no way to determine which is which).
If the source code is hidden it becomes a whole other story. Now the user or malicious party doesn't have access to the secrets the real client knows and all the regular security models apply. However it is doubtful that Chrome will allow hidden source code in extensions, because it would produce other security issues.
Some source code can be hidden using NPAPI Plugins as you stated, but it comes with a price as you already know.
Coming back to the current state of things:
Now it becomes a question of what is meant by interaction.
If interaction means that while the user is on the page you want to know if it is your extension or some other then the closest you can get is to list your page in the extensions manifest under app section as documented here
This will allow you to ask on the page if the app is installed by using
chrome.app.isInstalled
This will return boolean showing wether your app is installed or not. The command is documented here
However this does not really solve the problem, since the extension may be installed, but not enabled and there is another extension mocking the communication with your site.
Furthermore the validation is on the client side so any function that uses that validation can be overwritten to ignore the result of this variable.
If however the interaction means making XMLHttpRequests then you are out of luck. Can't be done using current methods because of the visibility of source code as discussed above.
However if it is limiting your sites usability to authorized entities I suggest using regular means of authentication: having the user log in will allow you to create a session. This session will be propagated to all requests made by the extension so you are down to regular client log in trust issues like account sharing etc. These can of course be managed by making the user log in say via their Google account, which most are reluctant to share and further mitigated by blocking accounts that seem to be misused.

I would suggest to do something similar to what Git utilises(have a look at http://git-scm.com/book/en/Git-Internals-Git-Objects to understand how git implements it), i.e.
Creating SHA1 values of the content of every file in your
chrome-extension and then re-create another SHA1 value of the
concatenated SHA1 values obtained earlier.
In this way, you can share the SHA1 value with your server and authenticate your extension, as the SHA1 value will change just in case any person, changes any of your file.
Explaining it in more detail with some pseudo code:
function get_authentication_key(){
var files = get_all_files_in_extension,
concatenated_sha_values = '',
authentication_key;
for(file in files){
concatenated_sha_values += Digest::SHA1.hexdigest(get_file_content(file));
}
$.ajax({
url: 'http://example.com/getauthkey',
type: 'post'
async: false,
success:function(data){
authentication_key = data;
}
})
//You may return either SHA value of concatenated values or return the concatenated SHA values
return authentication_key;
}
// Server side code
get('/getauthkey') do
// One can apply several type of encryption algos on the string passed, to make it unbreakable
authentication_key = Digest::<encryption>.hexdigest($_GET['string']);
return authentication_key;
end
This method allows you to check if any kind of file has been changed maybe an image file or a video file or any other file. Would be glad to know if this thing can be broken as well.

Change server HTML app into self-contained desktop app

I wrote a simple web server that takes the public link to a google document containing image urls and names and outputs a print-friendly HTML photo directory with its contents.
I created it for a volunteer organization that I will no longer be able to stay involved in. I need to pass on the ability to generate that directory to my successor.
I'm not confident that I can trust myself to maintain that web application for the long term the organization needs. I'm hoping that instead I can change it to a self contained program, that members of the org could email around to whoever needed to generate the directory.
My first thought was to make a .html file the could open in a browser but I can't download the CSV data from google with Ajax, because it is cross domain. After googling there doesn't seem to be a way around this.
Is there a straightforward framework? I would guess I could do it with Adobe AIR, but I'd prefer something that simply removed the cross domain security feature.
I could take the time to embed a UIWebView into a Mac app, but since I want to write the app primarily in HTML, I'd have to create a bridge to let the web view make a cross domain request anyway right? Also it's not cross platform.
Any other ideas? How can I package my app as a desktop application instead of a web service?

You can get around the cross domain XHR using flash. CrossXhr can do it from apps served by regular http servers. I've never tried it with a static, file-served webapp. Follow the instructions here:
http://code.google.com/p/crossxhr/wiki/CrossXhr

Develop Reference

JavaScript is the programming language of the Web.