I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-
Setting viewportSize to a large height right after var page = require('webpage').create();
page.viewportSize = { width: 1600, height: 10000,
};
Using page.scrollPosition = { top: 10000, left: 0 } inside page.open but have no effect like-
page.open('http://example.com/?q=houston', function(status) {
if (status == "success") {
page.scrollPosition = { top: 10000, left: 0 };
}
});
Also tried putting it inside page.evaluate function but that gives
Reference error: Can't find variable page
Tried using jQuery and JS code inside page.evaluate and page.open but to no avail-
$("html, body").animate({ scrollTop: $(document).height() }, 10,
function() {
//console.log('check for execution');
});
as it is and also inside document.ready. Similarly for JS code-
window.scrollBy(0,10000)
as it is and also inside window.onload
I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.
Update
I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0
var hitRockBottom = false; while (!hitRockBottom) {
// Scroll the page (not sure if this is the best way to do so...)
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
// Check if we've hit the bottom
hitRockBottom = page.evaluate(function() {
return document.querySelector(".has-more-items") === null;
}); }
Where .has-more-items is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.
However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 }; with codes from below as well (one at a time)
window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.href=".has-more-items";
But nothing seems to work.
Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript works asynchronously so you have to use setInterval or setTimeout (see) to achieve this.
page.open('http://example.com/?q=houston', function () {
// Check for the bottom div and scroll down from time to time
window.setInterval(function() {
// Check if there is a div with class=".has-more-items"
// (not sure if there's a better way of doing this)
var count = page.content.match(/class=".has-more-items"/g);
if(count === null) { // Didn't find
page.evaluate(function() {
// Scroll to the bottom of page
window.document.body.scrollTop = document.body.scrollHeight;
});
}
else { // Found
// Do what you want
...
phantom.exit();
}
}, 500); // Number of milliseconds to wait between scrolls
});
I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.
It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.
function ScrollForAjax () {
scrollintervals = 50;
scrollmaxtime = 1000;
if(typeof(scrolltime)=="undefined"){
scrolltime = 0;
}
scrolldocheight1 = $(iframeselector).contents().find("body").height();
$("body").scrollTop(scrolldocheight1);
setTimeout(function(){
scrolldocheight2 = $("body").height();
if(scrolltime===scrollmaxtime || scrolltime>scrollmaxtime){
scrolltime = 0;
$("body").scrollTop(0);
ScrapeCurrentPage(iframeselector);
}
else if(scrolldocheight2>scrolldocheight1){
scrolltime = 0;
ScrollForAjax (iframeselector);
}
else if(scrolldocheight1>=scrolldocheight2){
ScrollForAjax (iframeselector);
}
},scrollintervals);
scrolltime += scrollintervals;
}
scrollmaxtime is a timeout variable. Hope this is useful to someone :)
The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use window (but I may be wrong on that), which makes me doubt that window works.
The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within casper.evaluate function).
$(document).scrollTop($(document).height());
What did work for me in CasperJS was:
casper.scrollToBottom();
casper.wait(1000, function waitCb() {
casper.capture("loadedContent.png");
});
Which, also worked when moving casper.capture into Casper's then function.
However, the above solution won't work on some sites like Twitter; jQuery seems to break the casper.scrollToBottom() function, and I had to remove the clientScripts reference to jQuery when working within Twitter.
var casper = require('casper').create({
clientScripts: [
// 'jquery.js'
]
});
Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS scrollToBottom(). Not sure why some sites work and others don't.
The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.
page.open(pageUrl).then(function (status) {
var count = 0;
// Scrolls to the bottom of page
function scroll2btm() {
if (count < 500) {
page.evaluate(function(limit) {
window.scrollTo(0, document.body.scrollHeight || document.documentElement.scrollHeight);
return document.getElementsByClassName('pinWrapper').length; // use desired contents (eg. pin) selector for count presence number
}).then(function(c) {
count = c;
console.log(count); // print no of content found to check
});
setTimeout(scroll2btm,3000);
} else {
// required number of item found
}
}
scroll2btm();
});
Related
What would be the most efficient way to show and hide iframes in a page every x time?
I was thinking on using a setInterval on a function that uses jQuery's hide and show but this seems inneficient and not very scalable if I needed to hide and show 1 out of 10 iframes in one page that I would also need to hide and show.
$(document).ready(function(){
setInterval(function() {
if ($('#basic').is(":visible") && $('#advanced').not(":visible") ) {
$('#basic').hide();
$('#advanced').show();
}else if($('#basic').not(":visible") && $('#advanced').is(":visible")) {
$('#basic').show();
$('#advanced').hide();
}else if($('#basic').is(":visible") && $('#advanced').is(":visible")) {
$('#basic').hide();
$('#advanced').show();
};
}, 30000);
});
Each id refers to 1 iframe so right now I am only dealing with 2 iframes. The reason I have that last if else statement is because both iframes are being displayed when I load the page.
Just a snippet since the OP asked for it. Posting as an answer so I can format the code a bit better. (warning: I haven't 100% tested it myself yet, this isn't meant as a copy/paste implementation.)
This should show all the hidden frames and hide all the visible frames every 30 sec.
You can obviously easily extend it to only show/hide specific nodes that you can reference by id since frames[theFramesID] will give you the reference and visible status. If you don't need to be able to access specific frames, you can obv simplify this and use an array instead of an object.
Just using some form of loop and caching your nodes instead of requerying the same node over and over again will already increase the scalability. Since you don't need to change the code once you add another frame.
One could probably replace the vanilla functions I used with jquery specific ones if needed. I'm not sure which version of jquery has built-in reduce.
$(document).ready(function(){
var frames = $('iframe').reduce(function ( accumulator, frame ) {
accumulator[frame.id] = {
'reference' : frame,
'visible' : frame.is(":visible")
};
return accumulator;
}, {});
setInterval(function() {
Object.keys(frames).forEach(function ( id ) {
var frame = frames[id];
if (frame.visible) {
frame.visible = false;
frame.reference.hide();
}
else {
frame.visible = true;
frame.reference.show();
}
});
}, 30000);
});
I am trying to send the user to a different url, if their browser height is smaller than a certain value. I want it to be checking for this all the time so have used the setInterval function. I cannot see what is wrong with it...
This is the code I am using:
<script type="text/javascript">
//if window height is less than 605px, go to google
setInterval(function changeFont() {
if (window.innerHeight < 605) {
window.location = "http://google.com";
}
}, 1);
</script>
and before it I run this code which works great:
<script type="text/javascript">
//if window is wider than 1340px, send to desktop
setInterval(function sendToDesktop() {
if (window.innerWidth >= 1340) {
window.location = "../index.html";
}
}, 1);
</script>
When I use the code for screen height, the redirect doesn't work, even if the screen height is under 605px. Is there anything obvious I'm missing here?
The problem is time, you have given 1.. Means less than milliseconds...you have given a very small value. So what happens is that the script executes so fast that redirection wont happen. I tested this, provided 10000 insted of 1, and it worked! So you have to change 1 or use a code shown below(it checks window height everytime)
<script>
function setWindowHeight(){
var windowHeight = window.innerHeight;
if (windowHeight < 605) {
window.location = "http://google.com";
}
}
window.addEventListener("resize",setWindowHeight,false);
</script>
Using timer is really a nonsense here. Just define a function which checks if the window is high enough and call it on resize. And don't forget to call it also on the first load.
function checkHeightredirectToGoogle() {
if (window.innerHeight < 605) {
window.location = "https://google.com";
}
}
window.addEventListener("resize", function(){
checkHeightredirectToGoogle()
});
checkHeightredirectToGoogle(); //first call
This solution will be easier to write and maintain and also faster on the client's computer, not mentioning mobile devices, where the improvement is even higher.
What you are doing is force the browser to queue a function every 1ms. And when the browser finally gets to the queue, it sees a few hundred functions waiting to run. It runs the functions - ALL of which ultimately evaluates to
window.location = 'http://google.com'
So your browser makes a few hundred redirects in a matter of milliseconds. And while it is furiously trying to do so, more functions are getting queued in.
Looking at your code, the problem will be even worse if a user has a screen 1440 x 600. Both redirect conditions will be satisfied.
Please detect and redirect on page load, and then listen for any subsequent resize events instead.
window.addEventListener('resize', detectAndRedirect);
function detectAndRedirect() {
if (window.innerHeight < 605) {
window.location = 'whatever url';
}
else if (window.innerWidth >= 1340) {
window.location = 'whatever url';
}
}
If you really must use timers to detect resolution, the better way to do this would be:
var DETECT_PERIOD = 200; // set something sensible.
setTimeout(detectAndRedirect, DETECT_PERIOD);
function detectAndRedirect() {
if (window.innerHeight < 605) {
window.location = 'whatever url';
}
else if (window.innerWidth >= 1340) {
window.location = 'whatever url';
}
else {
setTimeout(detectAndRedirect, DETECT_PERIOD);
}
}
Don't you think that 1 is to small value in this case? These are miliseconds. That's first thing and the second:
I cannot see what is wrong with it...
So if you cannot see anything wrong then how can we help? What is your exact question? It doesn't redirect? What the console is showing? Any errors?
I'm working on an Infinite Load (e.g. Lazy Load) type functionality here is the function so far:
$(window).scroll(function() {
var ScrollPosition = $(window).scrollTop() + $(window).height();
var LoadMorePosition = $(document).height()-100;
if( ScrollPosition == LoadMorePosition ) {
console.log('loading more');
loadMoreItems();
}
});
It's working for the most part except for that the loadMoreItems function is called 20-30 times once the person scrolls to the threshhold.
I was thinking a setTimeout type thing might work but on second thought I realized that it would only work if the Ajax content loaded fast enough (which isn't guaranteed).
What I need is a way to detect if they hit the threshold and then call the function only once until they hit the scroll threshold again.
You could do it like this:
var loadOnce = function(fn) {
var loadedPages = {};
return function(page) {
var key = 'p'+page;
if (typeof loadedPages[key] === 'undefined') {
fn.apply(fn, arguments);
loadedPages[key] = true;
}
};
};
var loader = loadOnce(function(page) {
/* Ajax Request */
});
// Will only fire the ajax request once:
loader(1);
loader(1);
loader(1);
loader(1);
I was thinking a setTimeout type thing might work but on second thought I realized that it would only work if the Ajax content loaded fast enough (which isn't guaranteed).
Then don't rely on a timeout, instead reset the flag on the success\complete of the AJAX request.
AJAX call's success part should set load more content flag, so you do not load more if you do not have more.
And for scrolling, set timeout to detect when user stops scrolling and then execute showing loading gif and do ajax, on success, hide the gif, and load more :D
$(window).scroll(function(){
clearTimeout(scrolled);
scrolled = setTimeout(function(){
//Your function when user stops scrolling
}, 100);
});
I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.
Searching for a js script, which will show some message (something like "Loading, please wait") until the page loads all images.
Important - it mustn't use any js framework (jquery, mootools, etc), must be an ordinary js script.
Message must disappear when the page is loaded.
Yeah an old-school question!
This goes back to those days when we used to preload images...
Anyway, here's some code. The magic is the "complete" property on the document.images collection (Image objects).
// setup a timer, adjust the 200 to some other milliseconds if desired
var _timer = setInterval("imgloaded()",200);
function imgloaded() {
// assume they're all loaded
var loaded = true;
// test all images for "complete" property
for(var i = 0, len = document.images.length; i < len; i++) {
if(!document.images[i].complete) { loaded = false; break; }
}
// if loaded is still true, change the HTML
if(loaded) {
document.getElementById("msg").innerHTML = "Done.";
// clear the timer
clearInterval(_timer);
}
};
Of course, this assumes you have some DIV thrown in somewhere:
<div id="msg">Loading...</div>
Just add a static <div> to the page, informing user that the page is loading. Then add window.onload handler and remove the div.
BTW, what’s the reason of this? Don’t users already have page load indicators in their browsers?
You should do async ajax requests for the images and add a call back when it's finished.
Here's some code to illustrate it:
var R = new XMLHttpRequest();
R.onreadystatechange = function() {
if (R.readyState == 4) {
// Do something with R.responseXML/Text ...
stopWaiting();
}
};
Theoretically you could have an onload event on every image object that runs a function that checks if all images is loaded. This way you don´t need a setTimeOut(). This would however fail if an image didn´t load so you would have to take onerror into account also.