I am trying to scrape some information of a page using the jsdom.env function. However, the page that gets returned in the env() callback is about how access is denied to the server instead of the content that I am hoping to see when I load the same URL in a browser.
Thus, there seems to be a difference in how the browser loads the page vs. how jsdom is loading it. Is this something which can be configured in the jsdom module?
Edit:
Example URL: http://www.bestbuy.com/site/HP+-+20%22+Widescreen+Flat-Panel+LCD+Monitor/1422209.p?id=1218257754431&skuId=1422209
Update:
The issue was jsdom not specifying the user-agent http header. Look at the detailed answer below
The problem was that jsdom is not specifying a 'User-Agent' http header, which the bestbuy.com server are checking for. If its empty, access is denied. Currently, there is no way of specifying this through jsdom - https://github.com/tmpvar/jsdom/issues/196
A workaround that worked for me to use the request module to get the page content and then pass then on to jsdom to work on. The request module allows you to specify a user agent
Example:
var request = require('request'),
getPage = function(someUri, callback) {
request({uri: someUri, headers:{'User-Agent': 'Mozilla/5.0'}}, function (error, response, body) {
console.log("Fetched " +someUri+ " OK!");
callback(body);
});
}
getPage('http://www.bestbuy.com/', function(body) {
console.log(body)
});
By default, cross-domain AJAX calls are not possible.
More info here: http://m.snook.ca/archives/javascript/cross_domain_aj
Related
I am using the below code for redirection, if the user's country is not India then redirect it else keep on the same page
<script src="https://code.jquery.com/jquery-1.9.1.min.js"></script>
<script type="text/javascript">
function preloadFunc()
{
$.ajax('http://ip-api.com/json')
.then(
function success(response) {
if(response.country!="India")
{window.location.replace("https://www.google.com/");}
}
window.onpaint = preloadFunc();
</script>
What happens when you try to do the http call from an https initiated site:
jquery-1.9.1.min.js:5 Mixed Content: The page at 'https://******' was loaded over HTTPS, but requested an insecure XMLHttpRequest endpoint 'http://ip-api.com/json'. This request has been blocked; the content must be served over HTTPS.
If you try to use https for this call you get:
jquery-1.9.1.min.js:5 GET https://ip-api.com/json 403 (Forbidden)
and if you try https://ip-api.com/json direct in your browser you get
{"status":"fail","message":"SSL unavailable for this endpoint, order a key at https://members.ip-api.com/"}
Incidentally, you also have two JS syntax errors in your code. Here is a corrected version (not that it helps in getting the ip stuff returned over https I'm afraid).
<script src="https://code.jquery.com/jquery-1.9.1.min.js"></script>
<script type="text/javascript">
function preloadFunc()
{
$.ajax('https://ip-api.com/json')
.then(
function success(response) {console.log(response);
if(response.country!="India") {
window.location.replace("https://www.google.com/");
}
})
}
window.onpaint = preloadFunc();
</script>
There are two problems:
You cannot make an ajax request using a non-secure method (http) when your page is loaded using a secure method (https). So,if your page is loaded using https, make ajax calls only via https
When doing that, the other problem that occurs is with the security violation that happens when you use window.location.replace. The replace method rewrites the current page history in the browser and redirects the page. But the limitation is that the origin of the destination should be as same as where the page is served.
Use one of the following methods to redirect if you want to navigate away from the current origin.
window.location = 'https://www.google.com'
window.location.href = 'https:..www.google.com'
That endpoint dont support https Hit directly and check
I'm trying to use NodeJS to scrape a website that requires a login by POST.
Then once I'm logged in I can access a separate webpage by GET.
The first problem right now is logging in. I've tried to use request to POST the login information, but the response I get does not appear to be logged in.
exports.getstats = function (req, res) {
request.post({url : requesturl, form: lform}, function(err, response, body) {
res.writeHeader(200, {"Content-Type": "text/html"});
res.write(body);
res.end();
});
};
Here I'm just forwarding the page I get back, but the page I get back still shows the login form, and if I try to access another page it says I'm not logged in.
I think I need to maintain the client side session and cookie data, but I can find no resources to help me understand how to do that.
As a followup I ended up using zombiejs to get the functionality I needed
You need to make a cookie jar and use the same jar for all related requests.
var cookieJar = request.jar();
request.post({url : requesturl, jar: cookieJar, form: lform}, ...
That should in theory allow you to scrape pages with GET as a logged-in user, but only once you get the actual login code working. Based on your description of the response to your login POST, that may not be actually working correctly yet, so the cookie jar won't help until you fix the problems in your login code first.
The request.jar(); didn't work for me. So I am using the headers response to make another request like this:
request.post({
url: 'https://exampleurl.com/login',
form: {"login":"xxxx", "password":"xxxx"}
}, function(error, response, body){
request.get({
url:"https://exampleurl.com/logged",
header: response.headers
},function(error, response, body){
// The full html of the authenticated page
console.log(body);
});
});
Actualy this way is working fine. =D
Request manages cookies between requests if you enable it:
Cookies are disabled by default (else, they would be used in
subsequent requests). To enable cookies, set jar to true (either in
defaults or options).
const request = request.defaults({jar: true})
request('http://www.google.com', function () {
request('http://images.google.com')
});
I'm trying to load a cross-domain HTML page using AJAX but unless the dataType is "jsonp" I can't get a response. However using jsonp the browser is expecting a script mime type but is receiving "text/html".
My code for the request is:
$.ajax({
type: "GET",
url: "http://saskatchewan.univ-ubs.fr:8080/SASStoredProcess/do?_username=DARTIES3-2012&_password=P#ssw0rd&_program=%2FUtilisateurs%2FDARTIES3-2012%2FMon+dossier%2Fanalyse_dc&annee=2012&ind=V&_action=execute",
dataType: "jsonp",
}).success( function( data ) {
$( 'div.ajax-field' ).html( data );
});
Is there any way of avoiding using jsonp for the request? I've already tried using the crossDomain parameter but it didn't work.
If not is there any way of receiving the html content in jsonp? Currently the console is saying "unexpected <" in the jsonp reply.
jQuery Ajax Notes
Due to browser security restrictions, most Ajax requests are subject to the same origin policy; the request can not successfully retrieve data from a different domain, subdomain, port, or protocol.
Script and JSONP requests are not subject to the same origin policy restrictions.
There are some ways to overcome the cross-domain barrier:
CORS Proxy Alternatives
Ways to circumvent the same-origin policy
Breaking The Cross Domain Barrier
There are some plugins that help with cross-domain requests:
Cross Domain AJAX Request with YQL and jQuery
Cross-domain requests with jQuery.ajax
Heads up!
The best way to overcome this problem, is by creating your own proxy in the back-end, so that your proxy will point to the services in other domains, because in the back-end not exists the same origin policy restriction. But if you can't do that in back-end, then pay attention to the following tips.
**Warning!**
Using third-party proxies is not a secure practice, because they can keep track of your data, so it can be used with public information, but never with private data.
The code examples shown below use jQuery.get() and jQuery.getJSON(), both are shorthand methods of jQuery.ajax()
CORS Anywhere
2021 Update
Public demo server (cors-anywhere.herokuapp.com) will be very limited by January 2021, 31st
The demo server of CORS Anywhere (cors-anywhere.herokuapp.com) is meant to be a demo of this project. But abuse has become so common that the platform where the demo is hosted (Heroku) has asked me to shut down the server, despite efforts to counter the abuse. Downtime becomes increasingly frequent due to abuse and its popularity.
To counter this, I will make the following changes:
The rate limit will decrease from 200 per hour to 50 per hour.
By January 31st, 2021, cors-anywhere.herokuapp.com will stop serving as an open proxy.
From February 1st. 2021, cors-anywhere.herokuapp.com will only serve requests after the visitor has completed a challenge: The user (developer) must visit a page at cors-anywhere.herokuapp.com to temporarily unlock the demo for their browser. This allows developers to try out the functionality, to help with deciding on self-hosting or looking for alternatives.
CORS Anywhere is a node.js proxy which adds CORS headers to the proxied request.
To use the API, just prefix the URL with the API URL. (Supports https: see github repository)
If you want to automatically enable cross-domain requests when needed, use the following snippet:
$.ajaxPrefilter( function (options) {
if (options.crossDomain && jQuery.support.cors) {
var http = (window.location.protocol === 'http:' ? 'http:' : 'https:');
options.url = http + '//cors-anywhere.herokuapp.com/' + options.url;
//options.url = "http://cors.corsproxy.io/url=" + options.url;
}
});
$.get(
'http://en.wikipedia.org/wiki/Cross-origin_resource_sharing',
function (response) {
console.log("> ", response);
$("#viewer").html(response);
});
Whatever Origin
Whatever Origin is a cross domain jsonp access. This is an open source alternative to anyorigin.com.
To fetch the data from google.com, you can use this snippet:
// It is good specify the charset you expect.
// You can use the charset you want instead of utf-8.
// See details for scriptCharset and contentType options:
// http://api.jquery.com/jQuery.ajax/#jQuery-ajax-settings
$.ajaxSetup({
scriptCharset: "utf-8", //or "ISO-8859-1"
contentType: "application/json; charset=utf-8"
});
$.getJSON('http://whateverorigin.org/get?url=' +
encodeURIComponent('http://google.com') + '&callback=?',
function (data) {
console.log("> ", data);
//If the expected response is text/plain
$("#viewer").html(data.contents);
//If the expected response is JSON
//var response = $.parseJSON(data.contents);
});
CORS Proxy
CORS Proxy is a simple node.js proxy to enable CORS request for any website.
It allows javascript code on your site to access resources on other domains that would normally be blocked due to the same-origin policy.
CORS-Proxy gr2m (archived)
CORS-Proxy rmadhuram
How does it work?
CORS Proxy takes advantage of Cross-Origin Resource Sharing, which is a feature that was added along with HTML 5. Servers can specify that they want browsers to allow other websites to request resources they host. CORS Proxy is simply an HTTP Proxy that adds a header to responses saying "anyone can request this".
This is another way to achieve the goal (see www.corsproxy.com). All you have to do is strip http:// and www. from the URL being proxied, and prepend the URL with www.corsproxy.com/
$.get(
'http://www.corsproxy.com/' +
'en.wikipedia.org/wiki/Cross-origin_resource_sharing',
function (response) {
console.log("> ", response);
$("#viewer").html(response);
});
The http://www.corsproxy.com/ domain now appears to be an unsafe/suspicious site. NOT RECOMMENDED TO USE.
CORS proxy browser
Recently I found this one, it involves various security oriented Cross Origin Remote Sharing utilities. But it is a black-box with Flash as backend.
You can see it in action here: CORS proxy browser
Get the source code on GitHub: koto/cors-proxy-browser
You can use Ajax-cross-origin a jQuery plugin.
With this plugin you use jQuery.ajax() cross domain. It uses Google services to achieve this:
The AJAX Cross Origin plugin use Google Apps Script as a proxy jSON
getter where jSONP is not implemented. When you set the crossOrigin
option to true, the plugin replace the original url with the Google
Apps Script address and send it as encoded url parameter. The Google
Apps Script use Google Servers resources to get the remote data, and
return it back to the client as JSONP.
It is very simple to use:
$.ajax({
crossOrigin: true,
url: url,
success: function(data) {
console.log(data);
}
});
You can read more here:
http://www.ajax-cross-origin.com/
If the external site doesn't support JSONP or CORS, your only option is to use a proxy.
Build a script on your server that requests that content, then use jQuery ajax to hit the script on your server.
Just put this in the header of your PHP Page and it ill work without API:
header('Access-Control-Allow-Origin: *'); //allow everybody
or
header('Access-Control-Allow-Origin: http://codesheet.org'); //allow just one domain
or
$http_origin = $_SERVER['HTTP_ORIGIN']; //allow multiple domains
$allowed_domains = array(
'http://codesheet.org',
'http://stackoverflow.com'
);
if (in_array($http_origin, $allowed_domains))
{
header("Access-Control-Allow-Origin: $http_origin");
}
I'm posting this in case someone faces the same problem I am facing right now. I've got a Zebra thermal printer, equipped with the ZebraNet print server, which offers a HTML-based user interface for editing multiple settings, seeing the printer's current status, etc. I need to get the status of the printer, which is displayed in one of those html pages, offered by the ZebraNet server and, for example, alert() a message to the user in the browser. This means that I have to get that html page in Javascript first. Although the printer is within the LAN of the user's PC, that Same Origin Policy is still staying firmly in my way. I tried JSONP, but the server returns html and I haven't found a way to modify its functionality (if I could, I would have already set the magic header Access-control-allow-origin: *). So I decided to write a small console app in C#. It has to be run as Admin to work properly, otherwise it trolls :D an exception. Here is some code:
// Create a listener.
HttpListener listener = new HttpListener();
// Add the prefixes.
//foreach (string s in prefixes)
//{
// listener.Prefixes.Add(s);
//}
listener.Prefixes.Add("http://*:1234/"); // accept connections from everywhere,
//because the printer is accessible only within the LAN (no portforwarding)
listener.Start();
Console.WriteLine("Listening...");
// Note: The GetContext method blocks while waiting for a request.
HttpListenerContext context;
string urlForRequest = "";
HttpWebRequest requestForPage = null;
HttpWebResponse responseForPage = null;
string responseForPageAsString = "";
while (true)
{
context = listener.GetContext();
HttpListenerRequest request = context.Request;
urlForRequest = request.RawUrl.Substring(1, request.RawUrl.Length - 1); // remove the slash, which separates the portNumber from the arg sent
Console.WriteLine(urlForRequest);
//Request for the html page:
requestForPage = (HttpWebRequest)WebRequest.Create(urlForRequest);
responseForPage = (HttpWebResponse)requestForPage.GetResponse();
responseForPageAsString = new StreamReader(responseForPage.GetResponseStream()).ReadToEnd();
// Obtain a response object.
HttpListenerResponse response = context.Response;
// Send back the response.
byte[] buffer = System.Text.Encoding.UTF8.GetBytes(responseForPageAsString);
// Get a response stream and write the response to it.
response.ContentLength64 = buffer.Length;
response.AddHeader("Access-Control-Allow-Origin", "*"); // the magic header in action ;-D
System.IO.Stream output = response.OutputStream;
output.Write(buffer, 0, buffer.Length);
// You must close the output stream.
output.Close();
//listener.Stop();
All the user needs to do is run that console app as Admin. I know it is way too ... frustrating and complicated, but it is sort of a workaround to the Domain Policy problem in case you cannot modify the server in any way.
edit: from js I make a simple ajax call:
$.ajax({
type: 'POST',
url: 'http://LAN_IP:1234/http://google.com',
success: function (data) {
console.log("Success: " + data);
},
error: function (e) {
alert("Error: " + e);
console.log("Error: " + e);
}
});
The html of the requested page is returned and stored in the data variable.
To get the data form external site by passing using a local proxy as suggested by jherax you can create a php page that fetches the content for you from respective external url and than send a get request to that php page.
var req = new XMLHttpRequest();
req.open('GET', 'http://localhost/get_url_content.php',false);
if(req.status == 200) {
alert(req.responseText);
}
as a php proxy you can use https://github.com/cowboy/php-simple-proxy
Your URL doesn't work these days, but your code can be updated with this working solution:
var url = "http://saskatchewan.univ-ubs.fr:8080/SASStoredProcess/do?_username=DARTIES3-2012&_password=P#ssw0rd&_program=%2FUtilisateurs%2FDARTIES3-2012%2FMon+dossier%2Fanalyse_dc&annee=2012&ind=V&_action=execute";
url = 'https://google.com'; // TEST URL
$.get("https://images"+~~(Math.random()*33)+"-focus-opensocial.googleusercontent.com/gadgets/proxy?container=none&url=" + encodeURI(url), function(data) {
$('div.ajax-field').html(data);
});
<div class="ajax-field"></div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
You need CORS proxy which proxies your request from your browser to requested service with appropriate CORS headers. List of such services are in code snippet below. You can also run provided code snippet to see ping to such services from your location.
$('li').each(function() {
var self = this;
ping($(this).text()).then(function(delta) {
console.log($(self).text(), delta, ' ms');
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdn.rawgit.com/jdfreder/pingjs/c2190a3649759f2bd8569a72ae2b597b2546c871/ping.js"></script>
<ul>
<li>https://crossorigin.me/</li>
<li>https://cors-anywhere.herokuapp.com/</li>
<li>http://cors.io/</li>
<li>https://cors.5apps.com/?uri=</li>
<li>http://whateverorigin.org/get?url=</li>
<li>https://anyorigin.com/get?url=</li>
<li>http://corsproxy.nodester.com/?src=</li>
<li>https://jsonp.afeld.me/?url=</li>
<li>http://benalman.com/code/projects/php-simple-proxy/ba-simple-proxy.php?url=</li>
</ul>
Figured it out.
Used this instead.
$('.div_class').load('http://en.wikipedia.org/wiki/Cross-origin_resource_sharing #toctitle');
I am using superagent (installed with npm) to get information from an api. Here's the code in a javascript file:
const http = require('superagent');
http
.get('https://random.dog/woof.json')
.end( function(err, res) {
console.log(err);
console.log(res.body);
});
I can test this in my terminal by typing node app.js. Two messages appear in the console, first null, then { url: 'https://random.dog/2d394360-33e1-4c27-9e64-d65a2ab82d5b.jpg' }, which is what I am looking for. I then use a browserify command (browserify app.js -o bundle.js) to make my javascript file usable in an html file. Here is my html file's code:
<html>
<head>
</head>
<body>
<h1>Text</h1>
<script src="bundle.js"></script>
</body>
</html>
Relatively simple. This was just to make sure everything was smooth. I opened the HTML file in my browser (the latest version of firefox) and opened the developer console.This error appeared. I was mildly annoyed. I had used this exact same API when I was coding a discord bot and had experienced no issues. So, naturally I changed browsers. Same error. I did some research and still was a bit confused, so I tried to set a header. New js file:
const http = require('superagent');
http
.get('https://random.dog/woof.json')
.set('Access-Control-Allow-Origin', '*')
.end( function(err, res) {
console.log(err);
console.log(res.body);
});
This time, this error appeared. It seemed to be along the same lines.
Fortunately, I own a little website, so I uploaded these html and js files to the server. I had the exact same error. I even changed the .set('Access-Control-Allow-Origin', '*') to .set('Access-Control-Allow-Origin', 'http://example.com') (with example.com being the domain of my website, of course). There was no difference.
I decided to see if I could just make the request using the javascript in the html file, without calling in any other sources. I tried this code:
var HttpClient = function() {
this.get = function(aUrl, aCallback) {
var anHttpRequest = new XMLHttpRequest();
anHttpRequest.onreadystatechange = function() {
if (anHttpRequest.readyState == 4 && anHttpRequest.status == 200)
aCallback(anHttpRequest.responseText);
}
anHttpRequest.open( "GET", aUrl, true );
anHttpRequest.send( null );
}
}
var client = new HttpClient();
client.get('http://random.dog/woof.json', function(response) {
console.log(response);
});
and opened the new html file in firefox. I had the same error as the first time.
Why am I receiving these errors? What can I do to fix these errors? Thanks in advance.
An Ajax request to a different domain is blocked by default by the same-origin policy. The only way to allow such an Ajax request is via CORS, which requires the server to have CORS enabled.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS
https://www.html5rocks.com/en/tutorials/cors/
It is the server that must have the Access-Control-Allow-Origin header, not the client. As an example, try calling out to https://api.github.com/ rather than https://random.dog/woof.json, you'll find that you can access that URL because it has the CORS headers enabled.
Historically JSON-P was also used as a workaround for the same-origin policy but it is generally inferior to CORS and also requires server support.
A third way to solve this problem would be to reverse proxy the remote server through the server you use for your site so that the origins match. This approach can work well in some circumstances but brings it's own scaling and security considerations.
I was wondering if there is a way to check if an external file is online, or available with AngularJS (or JavaScript). And if I could determine the size in some way (would be a plus). I first thought of using the head function of $http. But it turns out this doesn't always work, and I can't seem to find the size in the header information...
var file = { src: '//code.jquery.com/jquery.min.js' };
$http.head(file.src)
.success(function(data, status, headers) {
file.size = headers(['content-length']); // always empty
file.status = 'online';
})
.error(function() {
file.status = 'offline';
});
Anyone any idea's?
The header you are looking for is HTTP content-length.
$http.head(file.src)
.then(function(data, status, headers, config) {
file.size = headers(['content-length']);
file.status = 'online';
});
Of course, since you are running on the browser you'll only be able to perform the request to your server or to any server that specify the CORS header Access-Control-Allow-Origin.
For everything else, you'll have to proxy the requests through your own server. Do notice, however, that not all servers respond to HEAD requests - in which case you'll have to perform a full GET and either cancel the request after a timeout (file is online and downloading) or an error is thrown (file is offline).
Ok. I have come to the conclusion that there is no (reliable) way to do this with JavaScript at this point in time. I tried implementing a solution by injecting a script tag into an hidden iframe, and then check the result with the onload or onerror events. However when there is an error in the external file you are loading your code stops as well. So even this method is not viable.
So from this point I will search for the solution on the server-side.