Javascript access another webpage - javascript

I know very, very little of javascript, but I'm interested in writing a script which needs information from another webpage. It there a javascript equivalent of something like urllib2? It doesn't need to be very robust, just enough to process a simple GET request, no need to store cookies or anything and store the results.

There is the XMLHttpRequest, but that would be limited to the same domain of your web site, because of the Same Origin Policy.
However, you may be interested in checking out the following Stack Overflow post for a few solutions around the Same Origin Policy:
Ways to circumvent the same-origin policy
UPDATE:
Here's a very basic (non cross-browser) example:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/questions/3315235', true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
console.log(xhr.responseText);
}
};
xhr.send(null);
If you run the above in Firebug, with Stack Overflow open, you'd get the HTML of this question printed in your JavaScript console:
JavaScript access another webpage http://img217.imageshack.us/img217/5545/fbugxml.png

You could issue an AJAX request and process it.

Write your own server, which runs the script to load the data from websites. Then from your web page, ask your server to fetch the data from websites and send them back to you.
see http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

Related

Webscraping without Node js possible?

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.
Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).
E.g. I would like to do:
Get first url link of a google image search.
Edit:
I now tried it and it worked find however after 2 Weeks I get now this error:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at ....
(Reason: CORS header ‘Access-Control-Allow-Origin’ missing).
any ideas how to solve that?
Here is the error described by firefox:
https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin
Yes, this is possible. Just use the XMLHttpRequest API:
var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true); // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
if (request.readyState === 4) {
if (request.status === 200) {
var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
console.log(a.href);
document.body.appendChild(a);
} else {
console.error(request.status, request.statusText);
}
}
};
request.onerror = function (e) {
console.error(request.status, request.statusText);
};
request.send(null); // not a POST request, so don't send extra data
Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.
Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.
Web workers are able to request HTML content using XMLHttpRequest, and then parse the incoming XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.
Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.
In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-complete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.
But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.
If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.
I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn

To read the HTTP responseText of a particular event (or for all the responses) from my website through javascript

Is it possible to read all the HTTP Request/Responses' header and body in a webpage through javascript or through any of its framework?
Screenshot of the required field (opens in the same window)
For example:- Like the way I could view/copy them, through my browser's developer tools.
To my understanding, if I could get hold onto an object or event that fires these requests then I can access the responseText property to fulfill my requirement.
My question is how do I do that? Is it even possible to get all the responseText for all the responses received in my webpage?
(As it has been rendered successfully, then possibly I should be able to access them as well, isn't it?)
I'm just a beginner, so not sure if my question is meaningful. Thanks for all the replies.
If I'm understanding correctly, you're asking how to retrieve detailed network traffic information for a given website. This is browser specific: Chrome for example exposes the chrome.devtools.network object which you can interrogate. See https://developer.chrome.com/extensions/devtools_network
Try this following javascript code:
var req = new XMLHttpRequest();
req.open('GET', document.location, false);
req.send(null);
var headers = req.getAllResponseHeaders().toLowerCase();
alert(headers);
it will get all the HTTP headers.

Not able to access Web API method using XMLHttpRequest but using Restclient Plugin and etc

I have run in to a problem. Please help with your expertise.
I am developing web solution for a company. They have provided me Web API Method (REST). This API is in their domain. I am too access this from my domain. Even client has also already whitelisted my domain.
I am trying to call this method using below. But no luck. I am getting this below error.
Error NS_ERROR_DOM_BAD_URI: Access to restricted URI denied
function GetCustomerInfo()
{
var Url = "http://webapi.company.com/vx.x/customer/get?format=xml&mobile=9876543210" ;
myXmlHttp = new XMLHttpRequest();
myXmlHttp.withCredentials = true;
myXmlHttp.onreadystatechange = ProcessRequest;
myXmlHttp.open( "GET", Url, true,"UID","PWD");
myXmlHttp.send();
}
function ProcessRequest()
{
if(this.readyState == this.DONE)
{
if(this.status == 200 && this.responseXML != null )
{
alert ("Received Resoponse from Server");
}
else
{
alert ("Some Problem");
}
}
}
I am able to access this method from RESTClient in firefox plugin.
Even I am able to access this method copying credentials in URL as below in browser address bar. I get anticipated response XML
http://UID:PWD#webapi.company.com/vx.x/customer/get?format=xml&mobile=9876543210
Please enlighten me where I am wrong. Perhaps JSONP can come to my help. But why i am not able to access this API using XMLHttpRequest.
Regards
Rajul
The same origin policy of the browser does not allow you to send XMLHttpRequests to a different domain. The reason you can access it through a firefox plugin or the address bar is that the same origin policy is not applied there.
You are right, JSONP could solve your problem, although you may run into trouble because you do not control the serverside.
In response to your comment: In order to use JSONP effectively, the server will need to return not only the data you need in JSON format, but also javascript code to invoke a callback when the request is done. If you do not control the data that is returned, you can not add the necessary code for this. The wikipedia article gives a good example for the general case.
I have never used CORS, thus can not give you much information on it. It seems like a better solution, but I imagine it is not incredibly compatible across browsers. Also, as I understand it, you will need control of the server as well, as it seems to require additional HTTP headers.

XMLHttpRequest not containing web page even though web page was received successfully [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Ways to circumvent the same-origin policy
I am making a personal web page that extracts the lottery powerball numbers and displays them. I have had success for all but this one link:
var xmlHttp = null;
xmlHttp = new XMLHttpRequest();
xmlHttp.open( "get", "http://www.powerball.com/powerball/pb_numbers.asp", false );
xmlHttp.send(null);
document.body.innerHTML = xmlHttp.responseText;
I checked the xmlHTTP.status and it is 0. However, using Live HTTP headers app I see that the request is sent and I do get a successful HTTP/1.0 200 OK where the page was received on my end. But, there is nothing received in the xmlHTTP object. No responseText, just status 0 for get not initialized.
EDIT: I do not see a Access-Control-Allow-Origin: directive in the return header. Why is this if I am being restricted because I am from a different domain?
You can't use XHR to read data from different origins. Since the request is made as the user of the browser, it is done with everything that might authenticate the user so might retrieve confidential information (which you could then use XHR to copy to your own server).
See this stackoverflow question for work arounds.
I'm not sure how it works nor it's capabilties but you seem to have an answer above on why it doesn't work. I recommend you to use ajax instead, it's simple and works just great.
Here's an example where I use it:
var site = $.ajax({
url: "http://google.com",
type: "GET",
dataType: "html",
async: false
}).responseText;
document.body.innerHTML = site;
Good luck,
Your problem here is the same origin policy. You won't be able to get any data from that website using AJAX unless that website provides a JSONP API (and even then it's not technically AJAX).
You can achieve what you are doing to some extent with an iframe but you will have to include the entire page and not just the relevant part of it.
If what you need to do is Web scraping then you will have some server-side proxy to do it.
Some tools that might help you:
YQL
Yahoo Pipes
Notable Web scraping tools on Wikipedia
Alternative to cross domain ajax is:
write proxy which will request the remote server using CURL
call that proxy file from ajax call

How to Request a Password Protected Page in Javascript

I'm working on a very simple Sidebar Gadget to analyze my router's monthly bandwidth usage and determine how far ahead or behind I am for that month. The data is located in a router-hosted, password protected webpage (I'm using DD-WRT, if it matters). I'd like to pass a page request to the router with Javascript, along with all the authentication information, to retrieve the page all in one go, but I can't seem to find the proper syntax for that. Any ideas?
Here's my code so far:
var ua = navigator.userAgent.toLowerCase();
if (!window.ActiveXObject){
req = new XMLHttpRequest();
}else if (ua.indexOf('msie 5') == -1){
req = new ActiveXObject("Msxml2.XMLHTTP");
}else{
req = new ActiveXObject("Microsoft.XMLHTTP");
}
req.open('GET', 'http://192.168.1.1/Status_Internet.asp', false, "username", "password");
req.send(null);
if(req.status == 200){
dump(req.responseText);
} else{
document.write("Error");
}
document.write("Second Error");
Firebug indicates that it throws an error on req.send(null) - specifically,
Access to restricted URI denied" code: "1012.
It may be because of the same-origin policy, but in that case what can I use instead of an xmlhttpRequest?
It is because of the same-origin policy, the alternative is an iframe but that will not really give you what you wish for.
If it is http-auth you used to be able to request the page with http://username:pass#site but i must admit i haven't tried to use that for a long time, so i don't know if it is still supported.
EDIT:
If this doesn't work, maybe you can use basic http auth as described here: http://en.wikipedia.org/wiki/Basic_access_authentication but this would require you to sue a serverside proxy, since you can't manipulate the request headers from javascript when xhr is not an option.
You need to add a Authorization header to the request. I'm not an AJAX expert, so I'm not sure if you can modify header fields in AJAX requests. If you can't, then you're doomed.
If you can, then this Wikipedia article contains an example what it must look like.
That's a security feature: you see, there are malicious scripts that used a similar technique to hack the routers (who changes their router password anyway? Apparently not Joe X. Schmoe).
Under the Same Origin Policy, AJAX requests are limited to the domain (and port) whence they originated: therefore, from a local page, you can't make a request to 192.168.1.1:80 .
There is a possible workaround - a server-side proxy (e.g. a PHP script that fetches the router pages for you) inside your network.
If the same-origin policy is not the issue, then
load jQuery on your page.
and use jQuery's $.ajax method.
$.ajax({
url: "path/to/your/page.html", // self-explanatory
password:"your password", // a string with your password.
success: function(data){ // on sucsess:
$("someDiv").html(data); // load in the retrived page
}
});
See the jQuery ajax API for details.

Categories

Resources