Webscraping without Node js possible?

Webscraping without Node js possible? - javascript

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.
Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).
E.g. I would like to do:
Get first url link of a google image search.
Edit:
I now tried it and it worked find however after 2 Weeks I get now this error:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at ....
(Reason: CORS header ‘Access-Control-Allow-Origin’ missing).
any ideas how to solve that?
Here is the error described by firefox:
https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin

Yes, this is possible. Just use the XMLHttpRequest API:
var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true); // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
if (request.readyState === 4) {
if (request.status === 200) {
var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
console.log(a.href);
document.body.appendChild(a);
} else {
console.error(request.status, request.statusText);
}
}
};
request.onerror = function (e) {
console.error(request.status, request.statusText);
};
request.send(null); // not a POST request, so don't send extra data
Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.

Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.
Web workers are able to request HTML content using XMLHttpRequest, and then parse the incoming XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.
Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.
In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-complete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.
But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.
If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.

I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn

Related

How do you use a simple REST API’s with JavaScript

How do you get data from a REST API with JavaScript. I have several basic API's that I would like to get data from that don't require any authentication. All of the API's return the data I want back in JSON. For example https://www.codewars.com/api/v1/users/MrAutoIt. I thought this would be a very simple process using xmlhttprequest but it appears the same-origin policy is giving me problems.
I have tried following several tutorials but they don’t seem to work on cross domains or I don’t understand them. I tried to post links to the tutorials but I don't have a high enough reputation on here yet.

If you are trying to access a web service that is not on the same host:port as the webpage that is issuing the request, you will bump into the same origin policy. There are several things you can do, but all of them require the owner of the service to do things for you.
1) Since same origin policy does not impact scripts, allow the service to respond by JSONP instead of JSON; or
2) Send Access-Control-Allow-Origin header in the web service response that grants your webpage access
If you cannot get the service owner to grant you access, you can make a request serverside (e.g. from Node.js or PHP or Rails code) from a server that is under your control, then forward the data to your web page. However, depending on terms of service of the web service, you may be in breach, and you risk them banning your server.

In fact, it depends on what your server REST API supports regarding JSONP or CORS. You also need to understand how CORS works because there are two different cases:
Simple requests. We are in this case if we use HTTP methods GET, HEAD and POST. In the case of POST method, only content types with following values are supported: text/plain, application/x-www-form-urlencoded, multipart/form-data.
Preflighted requests. When you aren't in the case of simple requests, a first request (with HTTP method OPTIONS) is done to check what can be done in the context of cross-domain requests.
That said, you need to add something into your AJAX requests to enable CORS support on the server side. I think about headers like Origin, Access-Control-Request-Headers and Access-Control-Request-Method.
Most of JS libraries / frameworks like Angular support such approach.
With jQuery (see http://api.jquery.com/jquery.ajax/). There are some possible configurations at this level through crossDomain and xhrFields > withCredentials.
With Angular (see How to enable CORS in AngularJs):
angular
.module('mapManagerApp', [ (...) ]
.config(['$httpProvider', function($httpProvider) {
delete $httpProvider.defaults.headers.common['X-Requested-With'];
});
If you want to use low-level JS API for AJAX, you need to consider several things:
use XMLHttpRequest in Firefox 3.5+, Safari 4+ & Chrome and XDomainRequest object in IE8+
use xhr.withCredentials to true, if you want to use credentials with AJAX and CORS.
Here are some links that could help you:
Understanding and using CORS: https://templth.wordpress.com/2014/11/12/understanding-and-using-cors/
4 jQuery Cross-Domain AJAX Request methods: http://jquery-howto.blogspot.fr/2013/09/jquery-cross-domain-ajax-request.html#cors (see "1. CORS (Cross-Origin Resource Sharing)")
Unleash your AJAX requests with CORS: http://dev.housetrip.com/2014/04/17/unleash-your-ajax-requests-with-cors/
Using CORS for Cross Domain AJAX requests: http://techblog.constantcontact.com/software-development/using-cors-for-cross-domain-ajax-requests/
Cross origin resource sharing cors AJAX requests between jQuery and Node.js: http://www.bennadel.com/blog/2327-cross-origin-resource-sharing-cors-ajax-requests-between-jquery-and-node-js.htm
Hop it helps you,
Thierry

Here is how you get data.
var request = new XMLHttpRequest();
request.open('GET', 'https://www.codewars.com/api/v1/users/MrAutoIt', true);
request.onload = function() {
if (this.status >= 200 && this.status < 400) {
var resp = this.response; // Success! this is your data.
} else {
// We reached our target server, but it returned an error
}
};
request.onerror = function() {
// There was a connection error of some sort
};
request.send();
As far as running into same origin policy... You should be requesting from an origin you control, or you can try disabling Chrome's web security, or installing an extension such as Allow-Control-Allow-Origin * to force headers.

For a get method you could have something like this:
#section scripts{
<script type="text/javascript">
$(function()
{
$.getJSON('/api/contact', function(contactsJsonPayload)
{
$(contactsJsonPayload).each(function(i, item)
{
$('#contacts').append('<li>' + item.Name + '</li>');
});
});
});
</script>
}
In this tutorial check the topic: Exercise 3: Consume the Web API from an HTML Client

CORS response not received on different computers

This is my first question on SO, but you have all helped me enormously in the past from existing posts - so thank you!
I am working on a Web/Database system using localhost through Xampp, but need to backup sql file to my one&one online server. I am using CORS for cross-domain with js to make the backup and it works on my PC, but not my clients. The request onload works for us both, as the files are saved, but my client does not receive the response message to confirm it has saved!! Anyone know why this might be - we are both running IE9 and same xampp versions.
Code I am using for CORS request is:
var request = new XMLHttpRequest();
request.open('POST', "http://www.mysite/Backups", true);
request.onload = function()
{
if (request.status === 200)
{ //response functions here}
request.send("Content="+backupContent);
}
Hope this is in the correct question format - its my first time remember!

I had a year ago a really similar problem with IE. Your client is using IE, that means they are quite big and serious, so I bet they also have specific settings for IE security.
Go to your IE security preferences and restrict everything you can - I cannot tell you exactly the name of the property, I have no explorer anymore, but with this you can reproduce this behaviour.
How to solve the issue? Usually they don't agree on changing their security settings, so the only way that worked for me is using JSONP instead of CORS. I know: not modern, uglly... But that works.
This is just a guess, I trust that everything is done correctly on your side.

Access to restricted URI denied using AngularJS and PHP on server side [duplicate]

Access to restricted URI denied" code: "1012 [Break On This Error]
xhttp.send(null);
function getXML(xml_file) {
if (window.XMLHttpRequest) {
var xhttp = new XMLHttpRequest(); // Cretes a instantce of XMLHttpRequest object
}
else {
var xhttp = new ActiveXObject("Microsoft.XMLHTTP"); // for IE 5/6
}
xhttp.open("GET",xml_file,false);
xhttp.send(null);
var xmlDoc = xhttp.responseXML;
return (xmlDoc);
}
I'm trying to get data from a XML file using JavaScript. Im using Firebug to test and debug on Firefox.
The above error is what I'm getting. It works in other places i used the same before, why is acting weird here?
Can someone help me why it's occuring?
Update:
http://jquery-howto.blogspot.com/2008/12/access-to-restricted-uri-denied-code.html
I found this link explaining the cause of the problem. But I didn't get what the solution given means can someone elaborate?

Another possible cause of this is when you are working with a .html file directly on the file system. For example, if you're accessing it using this url in your browser: C:/Users/Someguy/Desktop/MyProject/index.html
If that then has to make an ajax request, the ajax request will fail because ajax requests to the filesystem are restricted. To fix this, setup a webserver that points localhost to C:/Users/Someguy/Desktop/MyProject and access it from http://localhost/index.html

Sounds like you are breaking the same origin policy.
Sub domains, different ports, different protocols are considered different domains.

Try adding Access-Control-Allow-Origin:* header to the server side script that feeds you the XML. If you don't do it in PHP (where you can use header()) and try to read a raw XML file, you probably have to set the header in a .htaccess file by adding Header set Access-Control-Allow-Origin "*". In addition you might need to add Access-Control-Allow-Headers:*.
Also I'd recommend to replace the * in production mode to disallow everybody from reading your data and instead add your own url there.

Without code impossible to say, but you could be running foul of the cross-site ajax limitation: you cannot make ajax requests to other domains.

Not able to access Web API method using XMLHttpRequest but using Restclient Plugin and etc

I have run in to a problem. Please help with your expertise.
I am developing web solution for a company. They have provided me Web API Method (REST). This API is in their domain. I am too access this from my domain. Even client has also already whitelisted my domain.
I am trying to call this method using below. But no luck. I am getting this below error.
Error NS_ERROR_DOM_BAD_URI: Access to restricted URI denied
function GetCustomerInfo()
{
var Url = "http://webapi.company.com/vx.x/customer/get?format=xml&mobile=9876543210" ;
myXmlHttp = new XMLHttpRequest();
myXmlHttp.withCredentials = true;
myXmlHttp.onreadystatechange = ProcessRequest;
myXmlHttp.open( "GET", Url, true,"UID","PWD");
myXmlHttp.send();
}
function ProcessRequest()
{
if(this.readyState == this.DONE)
{
if(this.status == 200 && this.responseXML != null )
{
alert ("Received Resoponse from Server");
}
else
{
alert ("Some Problem");
}
}
}
I am able to access this method from RESTClient in firefox plugin.
Even I am able to access this method copying credentials in URL as below in browser address bar. I get anticipated response XML
http://UID:PWD#webapi.company.com/vx.x/customer/get?format=xml&mobile=9876543210
Please enlighten me where I am wrong. Perhaps JSONP can come to my help. But why i am not able to access this API using XMLHttpRequest.
Regards
Rajul

The same origin policy of the browser does not allow you to send XMLHttpRequests to a different domain. The reason you can access it through a firefox plugin or the address bar is that the same origin policy is not applied there.
You are right, JSONP could solve your problem, although you may run into trouble because you do not control the serverside.
In response to your comment: In order to use JSONP effectively, the server will need to return not only the data you need in JSON format, but also javascript code to invoke a callback when the request is done. If you do not control the data that is returned, you can not add the necessary code for this. The wikipedia article gives a good example for the general case.
I have never used CORS, thus can not give you much information on it. It seems like a better solution, but I imagine it is not incredibly compatible across browsers. Also, as I understand it, you will need control of the server as well, as it seems to require additional HTTP headers.

Javascript access another webpage

I know very, very little of javascript, but I'm interested in writing a script which needs information from another webpage. It there a javascript equivalent of something like urllib2? It doesn't need to be very robust, just enough to process a simple GET request, no need to store cookies or anything and store the results.

There is the XMLHttpRequest, but that would be limited to the same domain of your web site, because of the Same Origin Policy.
However, you may be interested in checking out the following Stack Overflow post for a few solutions around the Same Origin Policy:
Ways to circumvent the same-origin policy
UPDATE:
Here's a very basic (non cross-browser) example:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/questions/3315235', true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
console.log(xhr.responseText);
}
};
xhr.send(null);
If you run the above in Firebug, with Stack Overflow open, you'd get the HTML of this question printed in your JavaScript console:
JavaScript access another webpage http://img217.imageshack.us/img217/5545/fbugxml.png

You could issue an AJAX request and process it.

Write your own server, which runs the script to load the data from websites. Then from your web page, ask your server to fetch the data from websites and send them back to you.
see http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

Develop Reference

JavaScript is the programming language of the Web.