casperjs: trouble scraping an item viewed number from a page - javascript

The page I want to scrape is http://v.qq.com/page/k/9/2/k0188qdxy92.html, it is hosted in China so it takes some time to load. The data I want from this page is just the played count of the video, located southeast to the player, its selector is as shown in the picture.
When you open the page, you will notice that this number will display much later then other parts of the page.
var time1 = Date.now();
var time2;
var casper = require('casper').create();
var url = 'http://v.qq.com/page/k/9/2/k0188qdxy92.html';
casper.start(url,function(){
time2 = Date.now();
console.log((time2-time1)/1000);
this.echo(this.fetchText('.played_count em'));
})
casper.run();
This is what I tried at first. Yesterday it worked, but today every time it just print a blank line and return to the shell. I think it is probably because the number was requested asynchronously and the network is slow. So I added a wait time into the script:
var time1 = Date.now();
var time2;
var casper = require('casper').create();
var url = 'http://v.qq.com/page/k/9/2/k0188qdxy92.html';
casper.start(url);
casper.wait('6000',function(){
time2 = Date.now();
console.log((time2-time1)/1000);
this.echo(this.fetchText('.played_count em'));
})
casper.run();
Although it is slow to open the page, 60s is absolutely enough. However, this is what I get:
You can see only 1 out of 4 attempts I get the right number, which means my code is working, although something else is preventing me from always getting the right data. What is it? Could it be the network, or some script on the page?
I also tried using waitForSelector, and waitFor, but every time I get error message like waittimeout expired, exiting, even though I set the waitTimeout option to 30000 or 60000. I am really stuck here. Although I am new to casperjs, but I've successful scraped similar data from other video sites' pages, what is so special about this one?

Related

HTML or Javascript for switching url at a specific time

I am from Germany, sorry for my English.
I am searching for a simple HTML or Javascript code to switch to another URL at a specific time.
I Run a landing Page which has an offer thats closed at April 16 at ten o clock for example. I need a little Script or code which directs to another url when the รถfter will be closed.
I am thankful for any help.
Best regards
Marco
If the person has the landing page loaded and is viewing it, and you want to send them to a different URL when the offer closes, you can calculate how much time until that happens when they arrive at the landing page then send them to the new URL when that time is reached.
setTimeout will run a function after a timer has expired, and that function can change the window.location sending them to the new URL.
When the landing page loads get the current time and the time the offer expires:
let now = new Date();
let expires = new Date('2021-04-16 22:00:00');
Subtracting gives you the number of milliseconds until you reach the expires time, which is convenient because that's what the setTimeout function wants for it's "delay" parameter.
Verbosely, this could look like:
const now = new Date();
const expires = new Date('2021-04-16 22:00:00');
const delay = expires - now;
window.setTimeout(function() {
window.location = 'https://example.com/';
}, delay);
<div>
<p>This is the landing page</p>
</div>
This does not loop waiting for time to expire, as DCR warns in a comment; it just sets a timer, which the browser then takes care of.
All the math could be collapsed without using variables, so it becomes
window.setTimeout(function() {...etc...}, new Date('2021-04-16 22:00:00') - new Date());
Here is your script based on your mentioned time. After that specific time, link will be changed automatically.
"https://jsfiddle.net/rounak1/2pxLu5df/2/"
here is the code
let d1 = new Date()
var d2 = new Date('2021-04-16 22:00:00.00');
if (d1 > d2) {
window.location = 'https://www.google.com/'
}
I believe that this problem is better handled by your backend that can redirect the user to another page before the page loads since, if you do this in javascript, the page will have to load first.
However, since you want this to be implemented in Javascript you can do something like below:
var time = moment("16/04/2021 10:00", "DD/MM/YYYY HH:mm");
var now = new Date();
if (time > now) {
window.location.replace("replace with your url"); // Without allowing the user to hit the back button
}
In the above, time stores the datetime when the offer ends in the format enclosed as a string (read about this here: https://momentjs.com/docs/#/parsing/string-format/). This is compared to the time now to implement a redirect.
Read more about the redirect here: https://www.w3schools.com/howto/howto_js_redirect_webpage.asp
The above requires the usage of the moment.js library which you can find here: https://momentjs.com/

Keep the JS/jQuery code working in Safari when the tab is not active

I have a JS/jQuery code as shown below in which in which I want to keep the JS/jQuery code working when the session tab is not active.
The following code perfectly fine in Google Chrome but it doesn't work in Safari.
jQuery(document).ready(function ($) {
let lastActivity = <?php echo time(); ?>; // Line A
let now = <?php echo time(); ?>;
let logoutAfter = 3600; // page will logout after 1800 seconds if there is no activity
let userName = "<?php echo $_SESSION['user_name']; ?>";
let timer = setInterval(function () {
now++;
let delta = now - lastActivity;
console.log(delta); // Line A
if (delta > logoutAfter) {
clearInterval(timer);
//DO AJAX REQUEST TO close.php
$.ajax({
url: "/control/admin.php",
type: 'GET', // GET also fine
data: {action: 'logout', user_name: userName},
success: function (data) {
window.location.href = "admin.php";
},
error: function (jqXHR, textStatus, errorThrown) {
alert(textStatus);
}
});
}
}, 1000); //<-- you can increase it( till <= logoutAfter ) for better performance as suggested by #"Space Coding"
});
The value at Line A doesn't get incremented in Safari when the tab is not active but it works perfectly fine in Google Chrome. In Google Chrome, it works as expected.
You can replace counter (it counts seconds) with calculating time difference.
let lastActivity = new Date();
let logoutAfter = 3600;
...
let delta = (new Date()).getTime() - lastActivity.getTime();
if (delta > logoutAfter) {
...
}
P.S. So it must work even if the script itself is frozen when tab is inactive. Interval handler will be called at the moment when user activate this tab.
This approach will not work properly with multiple tabs opened. If user open new tab and started working in it, the earlier tab will logout the user as he is not active in that tab.
To overcome this, I will suggest to check the last active time from server using ajax call instead of doing it with javascript only.
According to this very thorough (but old) answer, setInterval() execution on inactive tabs is limited to max 1/s, on both Safari and Chrome - but not stopped. There are also plenty of questions here on SO about Javascript getting paused or de-prioritised on inactive tabs, some of which include solutions:
How can I make setInterval also work when a tab is inactive in Chrome?
iOS 5 pauses JavaScript when tab is not active
Safari JavaScript setTimeout stops when minimized
Chrome: timeouts/interval suspended in background tabs?
Probably the best option to do what you are trying is to use Web workers:
Web Workers are a simple means for web content to run scripts in background threads. The worker thread can perform tasks without interfering with the user interface.
There is an example of how to do that in an answer to one of the questions above.
But there is also a much simpler option, though you should evaluate if it is safe considering you are relying on this to log users out.
My testing of your code reflects the question I linked to earlier which describes setInterval() being slowed, but not stopped. For me, Safari (v 13.1, macOS 10.14.6) does not actually fully pause Javascript, but slows down execution of the loop, by increasing amounts. I see this by opening the dev console, and watching the output of the console.log(delta) messages - they slow right down, first running only every 2s, then 4s, and so on, though sometimes faster. But they do not stop.
That output also gives a hint about the problem, and the solution. The delta values shown on the console do not represent the real time difference since lastActivity. They are just incrementing numbers. If you see a delta value appear on the console 10 seconds after the last one, it should logically be +10, right? But it is not, it is just one higher.
And that's the problem here - the code is not counting the true time difference, it is just counting iterations of the loop:
let timer = setInterval(function () {
now++; // <-- problem
This code correctly sets now to the current time only if setInterval() runs exactly every second. But we know that when the tab is inactive, it does not. In that case it is just counting the number of times the loop runs, which has no relation to the real time elapsed.
To solve this problem, we have to determine now based on the real time. To do that, let's switch to using JS to calculate our timestamps (PHP is rendered only once, on page load, so if you use it inside the loop it will just stay fixed at the initial value):
// Note that JS gives us milliseconds, not seconds
let lastActivity = Date.now();
let now = Date.now();
let logoutAfter = 3600 * 1000;
let timer = setInterval(function () {
// PHP won't work, time() is rendered only once, on page load
// let now = <?php echo time(); ?>;
now = Date.now();
let delta = now - lastActivity;
console.log('New timer loop, now:', now, '; delta:', delta);
Now, even if there is a pause of 10s between iterations, delta will be the true measure of time elapsed since the page was loaded. So even if the user switches away to another tab, every time the loop runs, it will correctly track time, even if it doesn't happen every second.
So what does this mean in your case?
According to your report, JS is not running at all in the inactive tab. In that case, it can happen that the tab stays in the logged-in state, long past the time the user should have been logged out. However, assuming JS starts up again when you switch back the tab, the very first iteration of the loop will correctly calculate the time elapsed. If it is greater than your logout period, you will be logged out. So even though the tab stayed logged in longer than it should have, the user can't use it, since as soon as they switch to it they will be logged out. Note that "as soon" actually means "within 1 second plus the time it takes for the AJAX query to successfully log the user out".
In my testing, JS does not stop in an inactive Safari tab, but slows right down. In this case, it would mean that the user would be automatically logged out on the inactive tab, though not right at the time they should be. If the loop runs say every 8s, it could mean that the user would be logged out up to 7s later than they should have been. If iterations slow down even more, the delay can potentially be even more. Assuming JS starts up again as normal as soon as the user switches back the tab, behaviour will be exactly as above, the first iteration in that case will log them out.
EDIT
Here's simplified, complete code, and a JSFiddle showing it running and working.
jQuery(document).ready(function($) {
let lastActivity = Date.now();
let now = Date.now();
let logoutAfter = 3600 * 1000;
let timer = setInterval(function() {
now = Date.now();
let delta = now - lastActivity;
console.log('New timer loop, now:', now, '; delta:', delta);
if (delta > logoutAfter) {
alert('logout!');
}
}, 1000);
});

Why does setInterval not increment my clock properly in JavaScript?

I want to display the actual time in New York. I have a html div:
<div id="time"></div>
and also - I have a php script that returns the actual time:
<?php
date_default_timezone_set('UTC');
echo time();
?>
and it does it as a timestamp.
Now, I've created a js script:
var serverTime;
moment.tz.add('America/New_York|EST EDT|50 40|0101|1Lz50 1zb0 Op0');
function fetchTimeFromServer() {
$.ajax({
type: 'GET',
url: 'generalTime.php',
complete: function(resp){
serverTime = resp.responseText;
function updateTimeBasedOnServer(timestamp) { // Take in input the timestamp
var calculatedTime = moment(timestamp).tz("America/New_York");
var dateString = calculatedTime.format('h:mm:ss A');
$('#time').html(dateString + ", ");
};
var timestamp = serverTime*1000;
updateTimeBasedOnServer(timestamp);
setInterval(function () {
timestamp += 1000; // Increment the timestamp at every call.
updateTimeBasedOnServer(timestamp);
}, 1000);
}
})
};
fetchTimeFromServer();
setInterval(function(){
fetchTimeFromServer();
}, 5000);
and the idea behind it is that I want to fetch the data from server, display it on my webpage, then increment it every second for five seconds and then fetch the time from the server again (to keep consistence with time on the server). And later on - continue with doing so, fetching the time, incrementing it for 5 seconds, fetching it again, etc.
It works... almost. After the webpage stays open for some time I can see the actual time, but it 'blinks', and I can see that it shows different times - it's hard to explain, but it looks like there is some time already in that div and new time tries to overlay it for each second. Seems like the previous time (content of this div) is not removed... I don't know how to create a jsfiddle with a call to remote server to fetch time from php, so I only have this information pasted above :(
What might be the problem here?
Since javascript is single threaded, setInterval may not acutally run your function after the delay. It adds the function to the stack to be run as soon as the processor is ready for it. If the processor has other events in the stack, it will take longer than the interval period to run. Multiple intervals or timeouts are all adding calls to the same stack for processing. To address this, you could use HTML5 web workers or try using setTimeout recursively.
Here is a good read on web workers: https://msdn.microsoft.com/en-us/hh549259.aspx

Is it possible to know how long a user has spent on a page?

Say I've a browser extension which runs JS pages the user visits.
Is there an "outLoad" event or something of the like to start counting and see how long the user has spent on a page?
I am assuming that your user opens a tab, browses some webpage, then goes to another webpage, comes back to the first tab etc. You want to calculate exact time spent by the user. Also note that a user might open a webpage and keep it running but just go away. Come back an hour later and then once again access the page. You would not want to count the time that he is away from computer as time spent on the webpage. For this, following code does a docus check every 5 minutes. Thus, your actual time might be off by 5 minutes granularity but you can adjust the interval to check focus as per your needs. Also note that a user might just stare at a video for more than 5 minutes in which case the following code will not count that. You would have to run intelligent code that checks if there is a flash running or something.
Here is what I do in the content script (using jQuery):
$(window).on('unload', window_unfocused);
$(window).on("focus", window_focused);
$(window).on("blur", window_unfocused);
setInterval(focus_check, 300 * 1000);
var start_focus_time = undefined;
var last_user_interaction = undefined;
function focus_check() {
if (start_focus_time != undefined) {
var curr_time = new Date();
//Lets just put it for 4.5 minutes
if((curr_time.getTime() - last_user_interaction.getTime()) > (270 * 1000)) {
//No interaction in this tab for last 5 minutes. Probably idle.
window_unfocused();
}
}
}
function window_focused(eo) {
last_user_interaction = new Date();
if (start_focus_time == undefined) {
start_focus_time = new Date();
}
}
function window_unfocused(eo) {
if (start_focus_time != undefined) {
var stop_focus_time = new Date();
var total_focus_time = stop_focus_time.getTime() - start_focus_time.getTime();
start_focus_time = undefined;
var message = {};
message.type = "time_spent";
message.domain = document.domain;
message.time_spent = total_focus_time;
chrome.extension.sendMessage("", message);
}
}
onbeforeunload should fit your request. It fires right before page resources are being unloaded (page closed).
<script type="text/javascript">
function send_data(){
$.ajax({
url:'something.php',
type:'POST',
data:{data to send},
success:function(data){
//get your time in response here
}
});
}
//insert this data in your data base and notice your timestamp
window.onload=function(){ send_data(); }
window.onbeforeunload=function(){ send_data(); }
</script>
Now calculate the difference in your time.you will get the time spent by user on a page.
For those interested, I've put some work into a small JavaScript library that times how long a user interacts with a web page. It has the added benefit of more accurately (not perfectly, though) tracking how long a user is actually interacting with the page. It ignore times that a user switches to different tabs, goes idle, minimizes the browser, etc.
Edit: I have updated the example to include the current API usage.
http://timemejs.com
An example of its usage:
Include in your page:
<script src="http://timemejs.com/timeme.min.js"></script>
<script type="text/javascript">
TimeMe.initialize({
currentPageName: "home-page", // page name
idleTimeoutInSeconds: 15 // time before user considered idle
});
</script>
If you want to report the times yourself to your backend:
xmlhttp=new XMLHttpRequest();
xmlhttp.open("POST","ENTER_URL_HERE",true);
xmlhttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
var timeSpentOnPage = TimeMe.getTimeOnCurrentPageInSeconds();
xmlhttp.send(timeSpentOnPage);
TimeMe.js also supports sending timing data via websockets, so you don't have to try to force a full http request into the document.onbeforeunload event.
The start_time is when the user first request the page and you get the end_time by firing an ajax notification to the server just before the user quits the page :
window.onbeforeunload = function () {
// Ajax request to record the page leaving event.
$.ajax({
url: "im_leaving.aspx", cache: false
});
};
also you have to keep the user session alive for users who stays long time on the same page (keep_alive.aspxcan be an empty page) :
var iconn = self.setInterval(
function () {
$.ajax({
url: "keep_alive.aspx", cache: false });
}
,300000
);
then, you can additionally get the time spent on the site, by checking (each time the user leaves a page) if he's navigating to an external page/domain.
Revisiting this question, I know this wouldn't be much help in a Chrome Ext env, but you could just open a websock that does nothing but ping every 1 second and then when the user quits, you know to a precision of 1 second how long they've spent on the site as the connection will die which you can escape however you want.
Try out active-timeout.js. It uses the Visibility API to check when the user has switched to another tab or has minimized the browser window.
With it, you can set up a counter that runs until a predicate function returns a falsy value:
ActiveTimeout.count(function (time) {
// `time` holds the active time passed up to this point.
return true; // runs indefinitely
});

How to measure a time spent on a page?

I would like to measure a time (in seconds in integers or minutes in floats) a user spends on a page. I know there is an unload event which I can trigger when they leave the page. But how to get a time they have already spent there?
The accepted answer is good, but (as an alternative) I've put some work into a small JavaScript library that times how long a user is on a web page. It has the added benefit of more accurately (not perfectly, though) tracking how long a user is actually interacting with the page. It ignore times that a user switches to different tabs, goes idle, minimizes the browser, etc. The Google Analytics method suggested in the accepted answer has the shortcoming (as I understand it) that it only checks when a new request is handled by your domain. It compares the previous request time against the new request time, and calls that the 'time spent on your web page'. It doesn't actually know if someone is viewing your page, has minimized the browser, has switched tabs to 3 different web pages since last loading your page, etc.
Edit: I have updated the example to include the current API usage.
Edit 2: Updating domain where project is hosted
https://github.com/jasonzissman/TimeMe.js/
An example of its usage:
Include in your page:
<!-- Download library from https://github.com/jasonzissman/TimeMe.js/ -->
<script src="timeme.js"></script>
<script type="text/javascript">
TimeMe.initialize({
currentPageName: "home-page", // page name
idleTimeoutInSeconds: 15 // time before user considered idle
});
</script>
If you want to report the times yourself to your backend:
xmlhttp=new XMLHttpRequest();
xmlhttp.open("POST","ENTER_URL_HERE",true);
xmlhttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
var timeSpentOnPage = TimeMe.getTimeOnCurrentPageInSeconds();
xmlhttp.send(timeSpentOnPage);
TimeMe.js also supports sending timing data via websockets, so you don't have to try to force a full http request into the document.onbeforeunload event.
If you use Google Analytics, they provide this statistic, though I am unsure exactly how they get it.
If you want to roll your own, you'll need to have some AJAX request that gets sent to your server for logging.
jQuery has a .unload(...) method you can use like:
$(document).ready(function() {
var start = new Date();
$(window).unload(function() {
var end = new Date();
$.ajax({
url: "log.php",
data: {'timeSpent': end - start},
async: false
})
});
});
See more here: http://api.jquery.com/unload/
The only caveat here is that it uses javascript's beforeunload event, which doesn't always fire with enough time to make an AJAX request like this, so reasonably you will lose alot of data.
Another method would be to periodically poll the server with some type of "STILL HERE" message that can be processed more consistently, but obviously way more costly.
In addition to Jason's answer, here's a small piece of code that should do the trick if you prefer to not use a library, it considers when the user switch tabs or focus another window.
let startDate = new Date();
let elapsedTime = 0;
const focus = function() {
startDate = new Date();
};
const blur = function() {
const endDate = new Date();
const spentTime = endDate.getTime() - startDate.getTime();
elapsedTime += spentTime;
};
const beforeunload = function() {
const endDate = new Date();
const spentTime = endDate.getTime() - startDate.getTime();
elapsedTime += spentTime;
// elapsedTime contains the time spent on page in milliseconds
};
window.addEventListener('focus', focus);
window.addEventListener('blur', blur);
window.addEventListener('beforeunload', beforeunload);
๐—จ๐˜€๐—ฒ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ.๐—ป๐—ผ๐˜„()
Running inline code to get the time that the user got to the page blocks the loading of the page. Instead, use performance.now() which shows how many milliseconds have elapsed since the user first navigated to the page. Date.now, however, measures clock-time which can differ from navigation-time by a second or more due to factors such as Time resynchonization and leap seconds. performance.now() is supported in IE10+ and all evergreen browsers (evergreen=made for fun, not for profit). The earliest version of internet explorer still around today is Internet Explorer 11 (the last version) since Microsoft discontinued Windows XP in 2014.
(function(){"use strict";
var secondsSpentElement = document.getElementById("seconds-spent");
var millisecondsSpentElement = document.getElementById("milliseconds-spent");
requestAnimationFrame(function updateTimeSpent(){
var timeNow = performance.now();
secondsSpentElement.value = round(timeNow/1000);
millisecondsSpentElement.value = round(timeNow);
requestAnimationFrame(updateTimeSpent);
});
var performance = window.performance, round = Math.round;
})();
Seconds spent on page: <input id="seconds-spent" size="6" readonly="" /><br />
Milliseconds spent here: <input id="milliseconds-spent" size="6" readonly="" />
I'd say your best bet is to keep track of the timing of requests per session ID at your server. The time the user spent on the last page is the difference between the time of the current request, and the time of the prior request.
This won't catch the very last page the user visits (i.e. when there isn't going to be another request), but I'd still go with this approach, as you'd otherwise have to submit a request at onunload, which would be extremely error prone.
i think the best way is to store time in onload and unload event handlers in cookies e.g. and then analyze them in server-side scripts
According to the right answer I think thats is not the best solution. Because according to the jQuery docs:
The exact handling of the unload event has varied from version to
version of browsers. For example, some versions of Firefox trigger the
event when a link is followed, but not when the window is closed. In
practical usage, behavior should be tested on all supported browsers
and contrasted with the similar beforeunload event.
Another thing is that you shouldn't use it after documents load because the result of substraction of time can be fake.
So the better solution is to add it to the onbeforeunload event in the end of the <head> section like this:
<script>
var startTime = (new Date()).getTime();
window.onbeforeunload = function (event) {
var timeSpent = (new Date()).getTime() - startTime,
xmlhttp= new XMLHttpRequest();
xmlhttp.open("POST", "your_url");
xmlhttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
var timeSpentOnPage = TimeMe.getTimeOnCurrentPageInSeconds();
xmlhttp.send(timeSpent);
};
</script>
Of course if you want to count the time using Idle detector you can use:
https://github.com/serkanyersen/ifvisible.js/
TimeMe is a wrapper for the package that I paste above.
<body onLoad="myFunction()">
<script src="jquery.min.js"></script>
<script>
var arr = [];
window.onbeforeunload = function(){
var d = new Date();
var n = d.getTime();
arr.push(n);
var diff= n-arr[0];
var sec = diff/1000;
var r = Math.round(sec);
return "Time spent on page: "+r+" seconds";
};
function myFunction() {
var d = new Date();
var n = d.getTime();
arr.push(n);
}
</script>
I've found using beforeunload event to be unreliable, actually failing more often than not. Usually the page has been destroyed before the request gets sent, and you get a "network failure" error.
As others have stated, there is no sure-fire way to tell how long a user has been on a page. You can send up some clues however.
Clicking and scrolling are pretty fair indicators that someone is actively viewing the page. I would suggest listening for click and scroll events, and sending a request whenever one is fired, though not more often than say, every 30 or 60 seconds.
One can use a little intelligence in the calculations, eg, if there were events fired every 30 seconds or so for 5 minutes, then no events for 30 minutes, then a couple more events fired, chances are, the user was getting coffee during the 30 minute lapse.
let sessionid;
function utilize(action) {
// This just gets the data on the server, all the calculation is done server-side.
let href = window.location.href;
let timestamp = Date.now();
sessionid = sessionid || timestamp;
let formData = new FormData();
formData.append('sessionid', sessionid);
formData.append('timestamp', timestamp);
formData.append('href', href);
formData.append('action', action || "");
let url = "/php/track.php";
let response = fetch(url, {
method: "POST",
body: formData
});
}
let inhibitCall = false;
function onEvent() {
// Don't allow an update any more often than every 30 seconds.
if (!inhibitCall) {
inhibitCall = true;
utilize('update');
setTimeout(() => {
inhibitCall = false;
}, 30000);
}
}
window.addEventListener("scroll", onEvent);
window.addEventListener("click", onEvent);
utilize("open");

Categories

Resources