I have been using the following YOUTUBE TUTORIAL to learn how to work with web scraping. I have managed to complete the tutorial with some modifications due to the specific WEBSITE have changed their core structure. The code used for this tutorial can be found in here: https://github.com/beaucarnes/fcc-project-tutorials/blob/master/node-web-scraping/index.js. Now I want to modify this code so that it could work with my objective:
"Use a search engine from another website ec.europa.eu, and try and return the data (i.e. JOB-TITLE) from that page into my NODE console."
From the YouTube tutorial the event called to retrieve data, was specified from the demonstrator, but didn't actually explain how he was able to use it. In the website i'm seeking to retrieve information, there are 390 events called on that page. I want to identify which Request-URL is called for the search engine when submitted. Screenshots are provided below:
I have search through the events and was trying to find the event called for the search engine. I highlighted in the figure the event name that made more sense to me, but I'm unsure if this is the case.
I also tried to find the event (Request-URL) called in the Stack Overflow's search engine, but couldn't find which JS event was called from the Inspector>Networks.
My objective is to identify the specific events called in any website. Any information would be much appriciated, thanks ! :D
UPDATE:
const cheerio = require('cheerio');
const Table = require('cli-table');
const rp = require('request-promise');
const talbe = new Table({
head: ['Job Title', 'URL']
});
const options = {
url: 'https://ec.europa.eu/eures/eures-searchengine/page/jv-search/search?lang=en&app=2.4.1-build-2',
json: true
}
rp(options).then(
(data) => {
console.log("DONE");
}
).catch(
(err) => {
console.log(err);
}
);
This will return the following error:
StatusCodeError: 500 - undefined
at new StatusCodeError (C:\Users\loizo\Desktop\eures_test\node_modules\request-promise-core\lib\errors.js:32:15)
at Request.plumbing.callback (C:\Users\loizo\Desktop\eures_test\node_modules\request-promise-core\lib\plumbing.js:104:33)
at Request.RP$callback [as _callback] (C:\Users\loizo\Desktop\eures_test\node_modules\request-promise-core\lib\plumbing.js:46:31)
at Request.self.callback (C:\Users\loizo\Desktop\eures_test\node_modules\request\request.js:185:22)
at Request.emit (events.js:315:20)
....
You're almost there, within the network tools you can manipulate the overview to see segments of the requests.
Open up the network tools of the site, make sure you clear all the requests first. Then do a search.
This is far easier in person to show you or a video for that matter. But here's a set of images to guide you to looking at specific parts of the requests made when an action is done on a website.
See the images here. I've explained them individually below
Image1:
Here I've already loaded the page up you provided and clicked inspect and network tab.
I'm clicking the button you can see in red to clear all of those requests of the server.
Image2:
This is what it should look like when you clear the requests
Image3:
I've done a search for developer and you can see the requests for this action down below.
Image4:
Now at the overview you can select portions of that action's requests/responses. Here i'm honing in on the first part of that action. You just have to click and drag. Get a feel for this yourself.
I now can see those 5 requests down below
The first four requests are GET requests and don't really tell us much
The fifth request is a POST request this is the one which posts data, the information on the right hand side of the image will tell you where it posts to and what response it gets back.
Image5
Here is the same image as before but Ive scrolled down abit to see the payload. That is the key things that need to given along with the POST HTTP request to do a search on this website.
Coding Example
Note the comments on my post. Below is a code example that gets the JSON data you desire.
A collarary, that I Have never coded in node.js, so please be mindful of that! It does however work.
const cheerio = require('cheerio');
const Table = require('cli-table');
const rp = require('request-promise');
const talbe = new Table({
head: ['Job Title', 'URL']
});
const options = {
method: 'POST',
url: 'https://ec.europa.eu/eures/eures-searchengine/page/jv-search/search?lang=en&app=2.4.1-build-2',
json: true,
body: {
"keywords":[{"keyword":"developer","specificSearchCode":"EVERYWHERE"}],"positionScheduleCodes":[],"positionOfferingCodes":[],"educationLevelCodes":[],"euresFlagCodes":[],"nutsCodes":[],"notSpecifiedInNutsCodes":[],"requiredExperienceCodes":[],"solidarityContextCodes":[],"otherBenefitsCodes":[],"occupationUris":[],"includeJobsWithoutBenefits":false,"requiredLanguages":[],"includeJobsWithoutRequiredLanguages":false,"sortSearch":"BEST_MATCH","resultsPerPage":10,"page":1,"sessionId":"g07h0s8tfmmtfr5u9lible"
},
headers: {
'Connection': 'keep-alive',
'ajax-call': 'true',
'Accept': 'application/json, text/plain, */*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8',
'Origin': 'https://ec.europa.eu',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://ec.europa.eu/eures/eures-searchengine/page/main?lang=en',
'Accept-Language': 'en-US,en;q=0.9',
}
}
rp(options).then(
(data) => {
console.log("Got results =", data);
}
).catch(
(err) => {
console.log(err);
}
);
Explanation
In terms of additions to your own code, I've specifed that we're doing a POST request.
To get the additional things you need to make a successful HTTP Request you can right click the request in network tools, there's a bunch of options but you can copy it to CURL(bash). I used https://curl.trillworks.com/ to convert the curl command. You can select node.js.
I copied the headers as found on that website.
The body {} should contain our payload, in this case I copied the datastring on curl.trillworks.
I get the desired output.
Additional Information
Reverse engineering HTTP requests is about mimicing the requests, to make the server belief you're not a bot.
You can try just making a request to the server without anything else posting in this case it did not work you get a status code 500 error.
You have to think about the headers, any cookies, parameters required to mimic the request.
Here you just needed the headers and the parameters (that is the search terms you make when you do a search on this website)
Remember its a POST HTTP request you are giving the server information and expecting a response based on that post.
Additional Links
Request Docs: https://www.npmjs.com/package/request-promise
This was helpful in writing the request for JSON. Didn't fully explain that the body parameter could contain any data we wanted. I took that leap and it worked.
https://beshaimakes.com/js-scrape-data#case-1--using-apis-directly
Useful just for additional scraping plus abit about JSON scraping. Doesn't quite explain the headers part of it but hopefully with this example you can follow along.
https://stackabuse.com/the-node-js-request-module/
Useful in getting my head around the Request library, found this after I made your code work.
I'm trying to write a piece of Zap code with Run JavaScript to test the HTTP header response of a URL GET. Specifically, I'm interested in the return status and the location (basically, if it's a 302, want to know what the redirect location is).
fetch('https://www.example.com/', { method: 'GET', redirect: 'manual' })
.then(function(res) {
return res.json();
})
.then(function(json) {
var output = {status: json.status, location: json.headers.get('location')};
callback(null, output);
})
.catch(callback);
I've tried the above but (a) the test always returns rawHTML (which suggests it's following a the redirect, and (b) the output variables in the Send Outbound Zap step don't pick up anything useful (again, "Raw HTML", "ID", "Runtime Meta Logs", etc. but nothing about my headers).
You may not be able to access the Location header due to the same origin policy in most browsers: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy
Furthermore, you can't stop the AJAX call from following a redirect, so that may cause you issues: How to prevent jQuery ajax from following a redirect after a post?
It looks like you are using the new built-in fetch function: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
If so, in the provided example, I dont think you need the .json() call. I got the code to run like like below, but there is no redirect at example.com so not sure exactly how your situation will handle. Also, keep in mind the same-origin policy which will likely prohibit you from accessing location header.
var callback = function(a,b){
console.log(a,b)
};
fetch('https://www.example.com/', { method: 'GET', redirect: 'manual' })
.then(function(res) {
console.log (res)
var output = {status: res.status, location: res.headers.get('location')};
callback(null, output);
})
.catch(callback);
If you control the server resource, then you could possibly do something on the server, like adding another header that won't be blocked, many sites do that adding a X-Location option that browsers don't block.
Since we're using https://github.com/bitinn/node-fetch#options under the hood - specifically the redirect: 'follow' option. It even offers the exact "set to manual to extract redirect headers".
You might try experimenting with a local Node.js REPL to figure it out. If you see it working locally but not in Zapier - just file a bug to contact#zapier.com.
I managed to get this code working:
fetch('https://www.example.com/', { method: 'GET', redirect: 'manual', follow: 0 })
.then(function(res) {
var output = {status: res.status, location: res.headers._headers.location};
callback(null, output);
})
.catch(callback);
The underlying issue appears to be (as evidenced by the output variables with "id" and "rawHTML") that the fields were somehow "stuck". When I (1) deleted the Run Javascript step, (2) reinserted a new one with the above code, the correct output fields were then returned and subsequently became available to the Send Outbound Email step.
How to detect the Internet connection is offline in JavaScript?
Almost all major browsers now support the window.navigator.onLine property, and the corresponding online and offline window events. Run the following code snippet to test it:
console.log('Initially ' + (window.navigator.onLine ? 'on' : 'off') + 'line');
window.addEventListener('online', () => console.log('Became online'));
window.addEventListener('offline', () => console.log('Became offline'));
document.getElementById('statusCheck').addEventListener('click', () => console.log('window.navigator.onLine is ' + window.navigator.onLine));
<button id="statusCheck">Click to check the <tt>window.navigator.onLine</tt> property</button><br /><br />
Check the console below for results:
Try setting your system or browser in offline/online mode and check the log or the window.navigator.onLine property for the value changes.
Note however this quote from Mozilla Documentation:
In Chrome and Safari, if the browser is not able to connect to a local area network (LAN) or a router, it is offline; all other conditions return true. So while you can assume that the browser is offline when it returns a false value, you cannot assume that a true value necessarily means that the browser can access the internet. You could be getting false positives, such as in cases where the computer is running a virtualization software that has virtual ethernet adapters that are always "connected." Therefore, if you really want to determine the online status of the browser, you should develop additional means for checking.
In Firefox and Internet Explorer, switching the browser to offline mode sends a false value. Until Firefox 41, all other conditions return a true value; since Firefox 41, on OS X and Windows, the value will follow the actual network connectivity.
(emphasis is my own)
This means that if window.navigator.onLine is false (or you get an offline event), you are guaranteed to have no Internet connection.
If it is true however (or you get an online event), it only means the system is connected to some network, at best. It does not mean that you have Internet access for example. To check that, you will still need to use one of the solutions described in the other answers.
I initially intended to post this as an update to Grant Wagner's answer, but it seemed too much of an edit, especially considering that the 2014 update was already not from him.
You can determine that the connection is lost by making failed XHR requests.
The standard approach is to retry the request a few times. If it doesn't go through, alert the user to check the connection, and fail gracefully.
Sidenote: To put the entire application in an "offline" state may lead to a lot of error-prone work of handling state.. wireless connections may come and go, etc. So your best bet may be to just fail gracefully, preserve the data, and alert the user.. allowing them to eventually fix the connection problem if there is one, and to continue using your app with a fair amount of forgiveness.
Sidenote: You could check a reliable site like google for connectivity, but this may not be entirely useful as just trying to make your own request, because while Google may be available, your own application may not be, and you're still going to have to handle your own connection problem. Trying to send a ping to google would be a good way to confirm that the internet connection itself is down, so if that information is useful to you, then it might be worth the trouble.
Sidenote: Sending a Ping could be achieved in the same way that you would make any kind of two-way ajax request, but sending a ping to google, in this case, would pose some challenges. First, we'd have the same cross-domain issues that are typically encountered in making Ajax communications. One option is to set up a server-side proxy, wherein we actually ping google (or whatever site), and return the results of the ping to the app. This is a catch-22 because if the internet connection is actually the problem, we won't be able to get to the server, and if the connection problem is only on our own domain, we won't be able to tell the difference. Other cross-domain techniques could be tried, for example, embedding an iframe in your page which points to google.com, and then polling the iframe for success/failure (examine the contents, etc). Embedding an image may not really tell us anything, because we need a useful response from the communication mechanism in order to draw a good conclusion about what's going on. So again, determining the state of the internet connection as a whole may be more trouble than it's worth. You'll have to weight these options out for your specific app.
IE 8 will support the window.navigator.onLine property.
But of course that doesn't help with other browsers or operating systems. I predict other browser vendors will decide to provide that property as well given the importance of knowing online/offline status in Ajax applications.
Until that happens, either XHR or an Image() or <img> request can provide something close to the functionality you want.
Update (2014/11/16)
Major browsers now support this property, but your results will vary.
Quote from Mozilla Documentation:
In Chrome and Safari, if the browser is not able to connect to a local area network (LAN) or a router, it is offline; all other conditions return true. So while you can assume that the browser is offline when it returns a false value, you cannot assume that a true value necessarily means that the browser can access the internet. You could be getting false positives, such as in cases where the computer is running a virtualization software that has virtual ethernet adapters that are always "connected." Therefore, if you really want to determine the online status of the browser, you should develop additional means for checking.
In Firefox and Internet Explorer, switching the browser to offline mode sends a false value. All other conditions return a true value.
if(navigator.onLine){
alert('online');
} else {
alert('offline');
}
There are a number of ways to do this:
AJAX request to your own website. If that request fails, there's a good chance it's the connection at fault. The JQuery documentation has a section on handling failed AJAX requests. Beware of the Same Origin Policy when doing this, which may stop you from accessing sites outside your domain.
You could put an onerror in an img, like <img src="http://www.example.com/singlepixel.gif" onerror="alert('Connection dead');" />.
This method could also fail if the source image is moved / renamed, and would generally be an inferior choice to the ajax option.
So there are several different ways to try and detect this, none perfect, but in the absence of the ability to jump out of the browser sandbox and access the user's net connection status directly, they seem to be the best options.
As olliej said, using the navigator.onLine browser property is preferable than sending network requests and, accordingly with developer.mozilla.org/En/Online_and_offline_events, it is even supported by old versions of Firefox and IE.
Recently, the WHATWG has specified the addition of the online and offline events, in case you need to react on navigator.onLine changes.
Please also pay attention to the link posted by Daniel Silveira which points out that relying on those signal/property for syncing with the server is not always a good idea.
You can use $.ajax()'s error callback, which fires if the request fails. If textStatus equals the string "timeout" it probably means connection is broken:
function (XMLHttpRequest, textStatus, errorThrown) {
// typically only one of textStatus or errorThrown
// will have info
this; // the options for this ajax request
}
From the doc:
Error: A function to be called if the request
fails. The function is passed three
arguments: The XMLHttpRequest object,
a string describing the type of error
that occurred and an optional
exception object, if one occurred.
Possible values for the second
argument (besides null) are "timeout",
"error", "notmodified" and
"parsererror". This is an Ajax Event
So for example:
$.ajax({
type: "GET",
url: "keepalive.php",
success: function(msg){
alert("Connection active!")
},
error: function(XMLHttpRequest, textStatus, errorThrown) {
if(textStatus == 'timeout') {
alert('Connection seems dead!');
}
}
});
window.navigator.onLine
is what you looking for, but few things here to add, first, if it's something on your app which you want to keep checking (like to see if the user suddenly go offline, which correct in this case most of the time, then you need to listen to change also), for that you add event listener to window to detect any change, for checking if the user goes offline, you can do:
window.addEventListener("offline",
()=> console.log("No Internet")
);
and for checking if online:
window.addEventListener("online",
()=> console.log("Connected Internet")
);
The HTML5 Application Cache API specifies navigator.onLine, which is currently available in the IE8 betas, WebKit (eg. Safari) nightlies, and is already supported in Firefox 3
I had to make a web app (ajax based) for a customer who works a lot with schools, these schools have often a bad internet connection I use this simple function to detect if there is a connection, works very well!
I use CodeIgniter and Jquery:
function checkOnline() {
setTimeout("doOnlineCheck()", 20000);
}
function doOnlineCheck() {
//if the server can be reached it returns 1, other wise it times out
var submitURL = $("#base_path").val() + "index.php/menu/online";
$.ajax({
url : submitURL,
type : "post",
dataType : "msg",
timeout : 5000,
success : function(msg) {
if(msg==1) {
$("#online").addClass("online");
$("#online").removeClass("offline");
} else {
$("#online").addClass("offline");
$("#online").removeClass("online");
}
checkOnline();
},
error : function() {
$("#online").addClass("offline");
$("#online").removeClass("online");
checkOnline();
}
});
}
an ajax call to your domain is the easiest way to detect if you are offline
$.ajax({
type: "HEAD",
url: document.location.pathname + "?param=" + new Date(),
error: function() { return false; },
success: function() { return true; }
});
this is just to give you the concept, it should be improved.
E.g. error=404 should still mean that you online
I know this question has already been answered but i will like to add my 10 cents explaining what's better and what's not.
Window.navigator.onLine
I noticed some answers spoke about this option but they never mentioned anything concerning the caveat.
This option involves the use of "window.navigator.onLine" which is a property under Browser Navigator Interface available on most modern browsers. It is really not a viable option for checking internet availability because firstly it is browser centric and secondly most browsers implement this property differently.
In Firefox: The property returns a boolean value, with true meaning online and false meaning offline but the caveat here is that
"the value is only updated when the user follows links or when a script requests a remote page." Hence if the user goes offline and
you query the property from a js function or script, the property will
always return true until the user follows a link.
In Chrome and Safari: If the browser is not able to connect to a local area network (LAN) or a router, it is offline; all other
conditions return true. So while you can assume that the browser is
offline when it returns a false value, you cannot assume that a true
value necessarily means that the browser can access the internet. You
could be getting false positives, such as in cases where the computer
is running a virtualization software that has virtual ethernet
adapters that are always "connected".
The statements above is simply trying to let you know that browsers alone cannot tell. So basically this option is unreliable.
Sending Request to Own Server Resource
This involves making HTTP request to your own server resource and if reachable assume internet availability else the user is offline. There are some few caveats to this option.
No server availability is 100% reliant, hence if for some reason your server is not reachable it would be falsely assumed that the user is offline whereas they're connected to the internet.
Multiple request to same resource can return cached response making the http response result unreliable.
If you agree your server is always online then you can go with this option.
Here is a simple snippet to fetch own resource:
// This fetches your website's favicon, so replace path with favicon url
// Notice the appended date param which helps prevent browser caching.
fetch('/favicon.ico?d='+Date.now())
.then(response => {
if (!response.ok)
throw new Error('Network response was not ok');
// At this point we can safely assume the user has connection to the internet
console.log("Internet connection available");
})
.catch(error => {
// The resource could not be reached
console.log("No Internet connection", error);
});
Sending Request to Third-Party Server Resource
We all know CORS is a thing.
This option involves making HTTP request to an external server resource and if reachable assume internet availability else the user is offline. The major caveat to this is the Cross-origin resource sharing which act as a limitation. Most reputable websites blocks CORS requests but for some you can have your way.
Below a simple snippet to fetch external resource, same as above but with external resource url:
// Firstly you trigger a resource available from a reputable site
// For demo purpose you can use the favicon from MSN website
// Also notice the appended date param which helps skip browser caching.
fetch('https://static-global-s-msn-com.akamaized.net/hp-neu/sc/2b/a5ea21.ico?d='+Date.now())
.then(response => {
// Check if the response is successful
if (!response.ok)
throw new Error('Network response was not ok');
// At this point we can safely say the user has connection to the internet
console.log("Internet available");
})
.catch(error => {
// The resource could not be reached
console.log("No Internet connection", error);
});
So, Finally for my personal project i went with the 2nd option which involves requesting own server resource because basically there are many factors to tell if there is "Internet Connection" on a user's device, not just from your website container alone nor from a limited browser api.
Remember your users can also be in an environment where some websites or resources are blocked, prohibited and not accessible which in turn affects the logic of connectivity check. The best bet will be:
Try to access a resource on your own server because this is your users environment (Typically i use website's favicon because the response is very light and it is not frequently updated).
If there is no connection to the resource, simply say "Error in connection" or "Connection lost" when you need to notify the user rather than assume a broad "No internet connection" which depends on many factors.
I think it is a very simple way.
var x = confirm("Are you sure you want to submit?");
if (x) {
if (navigator.onLine == true) {
return true;
}
alert('Internet connection is lost');
return false;
}
return false;
The problem of some methods like navigator.onLine is that they are not compatible with some browsers and mobile versions, an option that helped me a lot was to use the classic XMLHttpRequest method and also foresee the possible case that the file was stored in cache with response XMLHttpRequest.status is greater than 200 and less than 304.
Here is my code:
var xhr = new XMLHttpRequest();
//index.php is in my web
xhr.open('HEAD', 'index.php', true);
xhr.send();
xhr.addEventListener("readystatechange", processRequest, false);
function processRequest(e) {
if (xhr.readyState == 4) {
//If you use a cache storage manager (service worker), it is likely that the
//index.php file will be available even without internet, so do the following validation
if (xhr.status >= 200 && xhr.status < 304) {
console.log('On line!');
} else {
console.log('Offline :(');
}
}
}
I was looking for a client-side solution to detect if the internet was down or my server was down. The other solutions I found always seemed to be dependent on a 3rd party script file or image, which to me didn't seem like it would stand the test of time. An external hosted script or image could change in the future and cause the detection code to fail.
I've found a way to detect it by looking for an xhrStatus with a 404 code. In addition, I use JSONP to bypass the CORS restriction. A status code other than 404 shows the internet connection isn't working.
$.ajax({
url: 'https://www.bing.com/aJyfYidjSlA' + new Date().getTime() + '.html',
dataType: 'jsonp',
timeout: 5000,
error: function(xhr) {
if (xhr.status == 404) {
//internet connection working
}
else {
//internet is down (xhr.status == 0)
}
}
});
How about sending an opaque http request to google.com with no-cors?
fetch('https://google.com', {
method: 'GET', // *GET, POST, PUT, DELETE, etc.
mode: 'no-cors',
}).then((result) => {
console.log(result)
}).catch(e => {
console.error(e)
})
The reason for setting no-cors is that I was receiving cors errors even when disbaling the network connection on my pc. So I was getting cors blocked with or without an internet connection. Adding the no-cors makes the request opaque which apperantly seems to bypass cors and allows me to just simply check if I can connect to Google.
FYI: Im using fetch here for making the http request.
https://www.npmjs.com/package/fetch
My way.
<!-- the file named "tt.jpg" should exist in the same directory -->
<script>
function testConnection(callBack)
{
document.getElementsByTagName('body')[0].innerHTML +=
'<img id="testImage" style="display: none;" ' +
'src="tt.jpg?' + Math.random() + '" ' +
'onerror="testConnectionCallback(false);" ' +
'onload="testConnectionCallback(true);">';
testConnectionCallback = function(result){
callBack(result);
var element = document.getElementById('testImage');
element.parentNode.removeChild(element);
}
}
</script>
<!-- usage example -->
<script>
function myCallBack(result)
{
alert(result);
}
</script>
<a href=# onclick=testConnection(myCallBack);>Am I online?</a>
Just use navigator.onLine if this is true then you're online else offline
request head in request error
$.ajax({
url: /your_url,
type: "POST or GET",
data: your_data,
success: function(result){
//do stuff
},
error: function(xhr, status, error) {
//detect if user is online and avoid the use of async
$.ajax({
type: "HEAD",
url: document.location.pathname,
error: function() {
//user is offline, do stuff
console.log("you are offline");
}
});
}
});
You can try this will return true if network connected
function isInternetConnected(){return navigator.onLine;}
Here is a snippet of a helper utility I have. This is namespaced javascript:
network: function() {
var state = navigator.onLine ? "online" : "offline";
return state;
}
You should use this with method detection else fire off an 'alternative' way of doing this. The time is fast approaching when this will be all that is needed. The other methods are hacks.
There are 2 answers forthis for two different senarios:-
If you are using JavaScript on a website(i.e; or any front-end part)
The simplest way to do it is:
<h2>The Navigator Object</h2>
<p>The onLine property returns true if the browser is online:</p>
<p id="demo"></p>
<script>
document.getElementById("demo").innerHTML = "navigator.onLine is " + navigator.onLine;
</script>
But if you're using js on server side(i.e; node etc.), You can determine that the connection is lost by making failed XHR requests.
The standard approach is to retry the request a few times. If it doesn't go through, alert the user to check the connection, and fail gracefully.