I have a site that I need to screenshot some parts. I was able to isolate the parts I want by using Xpath and take a screenshot. The problem is, when there are more than 3~4 parts it only renders the element only after it comes to the view port. So the Xpath only picks first 3~4 elements. The current approach is to scroll the page to the bottom using this code
async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight - window.innerHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
The problem with this is some pages contains more than 300 elements, with means It takes about 30s just to scroll. I cannot speed it up because If I do the elements will load incorrectly.
waitUntil : networkidle2 / 1 or waitForNavigaton does not work because when the page loads it only ask for the resources for the viewport. But when I scroll it asks for the next resource and so on.
Is there a way to load this webpage fully without scrolling/ (scrolling takes too much time as I said)
Edit :
File contails the code responsible for event triggering (I think)
https://codepaste.xyz/posts/zEVum8PmvdSO6PEqMdOQ
Related
I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-
Setting viewportSize to a large height right after var page = require('webpage').create();
page.viewportSize = { width: 1600, height: 10000,
};
Using page.scrollPosition = { top: 10000, left: 0 } inside page.open but have no effect like-
page.open('http://example.com/?q=houston', function(status) {
if (status == "success") {
page.scrollPosition = { top: 10000, left: 0 };
}
});
Also tried putting it inside page.evaluate function but that gives
Reference error: Can't find variable page
Tried using jQuery and JS code inside page.evaluate and page.open but to no avail-
$("html, body").animate({ scrollTop: $(document).height() }, 10,
function() {
//console.log('check for execution');
});
as it is and also inside document.ready. Similarly for JS code-
window.scrollBy(0,10000)
as it is and also inside window.onload
I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.
Update
I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0
var hitRockBottom = false; while (!hitRockBottom) {
// Scroll the page (not sure if this is the best way to do so...)
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
// Check if we've hit the bottom
hitRockBottom = page.evaluate(function() {
return document.querySelector(".has-more-items") === null;
}); }
Where .has-more-items is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.
However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 }; with codes from below as well (one at a time)
window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.href=".has-more-items";
But nothing seems to work.
Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript works asynchronously so you have to use setInterval or setTimeout (see) to achieve this.
page.open('http://example.com/?q=houston', function () {
// Check for the bottom div and scroll down from time to time
window.setInterval(function() {
// Check if there is a div with class=".has-more-items"
// (not sure if there's a better way of doing this)
var count = page.content.match(/class=".has-more-items"/g);
if(count === null) { // Didn't find
page.evaluate(function() {
// Scroll to the bottom of page
window.document.body.scrollTop = document.body.scrollHeight;
});
}
else { // Found
// Do what you want
...
phantom.exit();
}
}, 500); // Number of milliseconds to wait between scrolls
});
I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.
It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.
function ScrollForAjax () {
scrollintervals = 50;
scrollmaxtime = 1000;
if(typeof(scrolltime)=="undefined"){
scrolltime = 0;
}
scrolldocheight1 = $(iframeselector).contents().find("body").height();
$("body").scrollTop(scrolldocheight1);
setTimeout(function(){
scrolldocheight2 = $("body").height();
if(scrolltime===scrollmaxtime || scrolltime>scrollmaxtime){
scrolltime = 0;
$("body").scrollTop(0);
ScrapeCurrentPage(iframeselector);
}
else if(scrolldocheight2>scrolldocheight1){
scrolltime = 0;
ScrollForAjax (iframeselector);
}
else if(scrolldocheight1>=scrolldocheight2){
ScrollForAjax (iframeselector);
}
},scrollintervals);
scrolltime += scrollintervals;
}
scrollmaxtime is a timeout variable. Hope this is useful to someone :)
The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use window (but I may be wrong on that), which makes me doubt that window works.
The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within casper.evaluate function).
$(document).scrollTop($(document).height());
What did work for me in CasperJS was:
casper.scrollToBottom();
casper.wait(1000, function waitCb() {
casper.capture("loadedContent.png");
});
Which, also worked when moving casper.capture into Casper's then function.
However, the above solution won't work on some sites like Twitter; jQuery seems to break the casper.scrollToBottom() function, and I had to remove the clientScripts reference to jQuery when working within Twitter.
var casper = require('casper').create({
clientScripts: [
// 'jquery.js'
]
});
Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS scrollToBottom(). Not sure why some sites work and others don't.
The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.
page.open(pageUrl).then(function (status) {
var count = 0;
// Scrolls to the bottom of page
function scroll2btm() {
if (count < 500) {
page.evaluate(function(limit) {
window.scrollTo(0, document.body.scrollHeight || document.documentElement.scrollHeight);
return document.getElementsByClassName('pinWrapper').length; // use desired contents (eg. pin) selector for count presence number
}).then(function(c) {
count = c;
console.log(count); // print no of content found to check
});
setTimeout(scroll2btm,3000);
} else {
// required number of item found
}
}
scroll2btm();
});
I'm working on an Infinite Load (e.g. Lazy Load) type functionality here is the function so far:
$(window).scroll(function() {
var ScrollPosition = $(window).scrollTop() + $(window).height();
var LoadMorePosition = $(document).height()-100;
if( ScrollPosition == LoadMorePosition ) {
console.log('loading more');
loadMoreItems();
}
});
It's working for the most part except for that the loadMoreItems function is called 20-30 times once the person scrolls to the threshhold.
I was thinking a setTimeout type thing might work but on second thought I realized that it would only work if the Ajax content loaded fast enough (which isn't guaranteed).
What I need is a way to detect if they hit the threshold and then call the function only once until they hit the scroll threshold again.
You could do it like this:
var loadOnce = function(fn) {
var loadedPages = {};
return function(page) {
var key = 'p'+page;
if (typeof loadedPages[key] === 'undefined') {
fn.apply(fn, arguments);
loadedPages[key] = true;
}
};
};
var loader = loadOnce(function(page) {
/* Ajax Request */
});
// Will only fire the ajax request once:
loader(1);
loader(1);
loader(1);
loader(1);
I was thinking a setTimeout type thing might work but on second thought I realized that it would only work if the Ajax content loaded fast enough (which isn't guaranteed).
Then don't rely on a timeout, instead reset the flag on the success\complete of the AJAX request.
AJAX call's success part should set load more content flag, so you do not load more if you do not have more.
And for scrolling, set timeout to detect when user stops scrolling and then execute showing loading gif and do ajax, on success, hide the gif, and load more :D
$(window).scroll(function(){
clearTimeout(scrolled);
scrolled = setTimeout(function(){
//Your function when user stops scrolling
}, 100);
});
I'm trying to make pacman move without using jquery.animation because I want more control. so I'm using setInterval, but it only works sometimes. if you refresh enough, it will eventually "click on" and work fine, but if you refresh again, it won't work, it's here http://pacman.townsendwebdd.com if you want to look at it, thank you
//earlier in the code
this.moveInterval = setInterval(_this.move, 40, _this);
move: function(_this)
{
if(_this.pause)//set to true for now
return false;
var horz = 0;
var vert = 0;
var dir = _this.dir;
//set horizontal and vertical directions
if(dir % 2 == 0)
horz = dir - 1;
else
vert = dir - 2;
_this.top += vert;
_this.left += horz;
$('#pacman').css('top', _this.top);
$('#pacman').css('left', _this.left);
},
I'm not sure if this will fix the problem, but I would recommend using request animation frame instead of setInterval.
The other thing I think would be the problem (and I've been stung by this too) is that you're possibly trying to start the animation before the page has fully loaded. Try putting your code into a function and calling it with the onload attribute of the body tag.
Good luck!
Griffork.
I am trying to run some code that grabs the width and height of a div after it is loaded and filled with an image from an AJAX call.
The div is 0x0 until the image is placed so checking the dimensions before pretty much breaks everything.
I have tried .load() (Doesn't work because this is an AJAX call). I also tried the imagesLoaded js plugin. Here is what I have right now:
alert('outside;')
function startBubbles(){
alert('inside');
var space = new DisplaySpace($('#bubbleWrap').height(), $('#bubbleWrap').width());
var count = 1;
alert(space.getHeight());
var delay = (function(){
var timer = 0;
return function(afterResize, ms){
clearTimeout (timer);
timer = setTimeout(afterResize, ms);
};
})();
var spawnInterval = self.setInterval(function(){
var aBubble = new Bubble(count, space);
count++
aBubble.spawnimate(aBubble);
setTimeout(function() {aBubble.popBubble(aBubble)},Math.floor((Math.random()*30000)+10000));
}, 500);
});
My alerts all fire before the image is visible, then my 'space' object still has height and width of 0.
bubbleWrap is a div inside the loading zone that contains the image in question. I realize I could probably but a manual delay in here to solve the problem MOST of the time - however that doesn't seem optimal. What am I missing?
I'm now implementing the load of this particular state like this:
History.Adapter.bind(window,'popstate',function(){
var state = History.getState();
state = 'target='+state.data.state;
$.get('contents.php', state, function(data){
if(state == 'target=bubbles'){
$('#loading_zone').html(data).waitForImages(startBubbles());
}
else{
$('#loading_zone').html(data);
}
});
});
Unfortunately, on page reload, the height ends up as only 10. Everything seems great when I just navigate away and come back, though. Any thoughts?
I have a plugin that could do this nicely.
Simply call it after you inject the new HTML.
// Assume this function is the callback to the *success*.
function(html) {
$("#target").html(html).waitForImages(function() {
// All descendent images have now loaded, and the
// containing div will have the correct dimensions.
});
};
I've got some JavaScript to center images and <object>s on a page if they're over a threshold width. It also checks that certain classes haven't already been manually applied.
$('img,object').bind('load', function() {
w = $(this).width();
if (w > 400 && !( $(this).hasClass('inlineimage') | $(this).parent().hasClass('inlineimage') ))
$(this).css('margin', '10px ' + (parseInt((800-w)/2)-30) +'px');
});
It's horrific but the meaning behind this was all originally quite sane. The CMS doesn't make it easy to specify alignment and developing it to allow this would have taken significant time away from other jobs. A client-side hack works.
The only problem with it is that the JS waits until the whole image has loaded. Obviously this means that on slower networks, the page loads, the images start loading and some time later the images snap into position. Ugly.
But the browser seems to know the width of an image as soon as it starts to download it. I would really love to hook into this event and splat this visual bug.
Of course, if there's a CSS way of approaching this, I'm open to that too.
In browsers that support it, you can poll for natural dimensions:
var interval = setInterval( function() {
if( img.naturalWidth ) {
clearInterval(interval);
console.log( "Natural available: "+ (new Date - now );
console.log( img.naturalWidth, img.naturalHeight );
}
}, 0 );
In the demo here on uncached image I get:
Natural available: 782
62 71
Loaded: 827
So the real dimensions were available 50 milliseconds before load event. Unfortunately in IE, the readystate "loading" doesn't guarantee real dimensions.
Change the query string for the image before each test to ensure uncached.
Here's whatwg link on natural dimensions: http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#dom-img-naturalwidth
var span = document.getElementById('span'); // The parent span
var check = function (){
if(span.offsetWidth > 0){
console.log('Width while loading', span.offsetWidth);
}
else{
setTimeout(check, 100);
}
};
check();
Demo. This should show the width in the console while it's loading first, and then the width after it's loaded. That is as long as the image isn't cached. (If the demo doesn't work for someone, try changing the hoo part of the image URL to anything else)
In the interest of this still working on more than the latest browsers, I've cobbled together a best effort brute force. It waits 500ms between attempts and checks images to see if the current run through is the same width as the last time it tried.
As soon as the width for an image is the same in two consecutive passes, we run the centring code.
This uses arrays to keep track of things so we're not constantly raping the DOM nor are we querying items that aren't applicable (because they've already been dealt with or ruled out).
attempts = 0;
arr = [];
$.each($('img,object').not('inlineimage'), function(){
arr.push([this, -2, $(this).width()]);
});
checkImgs = function() {
attempts++;
newarr = []
$.each(arr, function(){
if ($(this[0]).parent().hasClass('inlineimage'))
return;
w = $(this[0]).width();
this[1] = this[2];
this[2] = w;
if (this[1] != this[2])
return newarr.push(this);
// We know this image is loaded enough now - we can do things!
if (w >= 400)
$(this[0]).css('margin', '10px ' + (parseInt((800-w)/2)-30) +'px');
});
arr = newarr;
if (arr.length && attempts < 6)
setTimeout(checkImgs, 500);
}
setTimeout(checkImgs, 500);
It's not beautiful but it seems to work both efficiently (CPU was getting hammered by some of my earlier attempts) and quickly (cached images spring into place within 500ms).