I am trying to crawl the contents of a website and it works most of the time (finishes in minutes) but sometimes it takes up to 8 hours. I've managed to pinpoint the issue and it is fired in the page.evaluate part. Looking at the website in headless mode=false it just loads infinitely (after a click). Also if I manually try "document.querySelector" on that page that is stuck in loading, it works for me.
The code is the following:
console.log("Test");
let value = await page.evaluate((sel) => {
let element = document.querySelector(sel);
return element? element.innerHTML: null;
},selector);
console.log("Test2");
What can I do to prevent it from running that long (I would try to setup some kind of timeout system for this case)?
Or how could I track the time while the code is in this part? The code immediately after this part never runs (only after hours) probably because of the await.
You should await page.waitForSelector(sel, { waitUntil:timeout }) first then evaluate
Related
So I am trying to make puppeteer lunch a page and then put a token inside a local storage.
.setItem not working it is just crushing my chromium.
so there is a page called discord
and if you have a user token you can log in to the page with a script
So I found out that someone has made a script that you can past in the console
and then when the code says "token here" you past your token and then it all happens
let token = "your token";
function login(token) {
setInterval(() => {
document.body.appendChild(document.createElement `iframe`).contentWindow.localStorage.token = `"${token}"`
}, 50);
setTimeout(() => {
location.reload();
}, 2500);
}
login(token);
This is the code.
so my idea is to make puppeteer run this code and then just login to the page after refresh
there is any option to do it?
If there is no option I have though about another solution,
maybe make puppeteer type in the console the whole code.
If you want to execute the JavaScript code in the browser context with Puppeteer, you need an evaluate method. For reference: https://github.com/puppeteer/puppeteer/blob/v10.4.0/docs/api.md#pageevaluatepagefunction-args or Google for similar and more user-friendly examples - there are plenty on web and SO.
Basically, to execute any JS code in a browser context, you put your code inside an evaluate method and it should look like this:
await page.evaluate(() => new Promise((resolve) => {
// your browser JS code goes here
}
As for the cookies part, they should persist even after reload (haven't tested, but check this for reference: Cookies gone after reload Puppeteer => page.setCookie(...cookies)).
Also, maybe unrelated, but be careful that everything is alright legal-wise, because bots and bot-like behavior is frowned upon by many sites and in breach of their ToS.
I have an app I'm working with that is behaving like this... You visit a url /refresh, and it loads the page with a loader/spinner/bar showing for like 5 seconds, then it refreshes the page after it's done. It does this so it can load the latest data that was computed during /refresh.
Right now I am just setting a timeout longer than the loader will most likely stay around, but this is brittle because a bad network connection could put it over the line.
How can I instead "watch" for when the refresh happens? What technique would you recommend. It seems to start to get hairy pretty fast.
Into the nitty gritty, when the loader is showing, when it finishes it is gone for like a half a second before the page reload. So I can't just wait til the loader is gone. It seems like I need to keep some sort of state variable around in the DOM like in localStorage, but can't pinpoint it. Would love some help.
well you could "watch" for the element that display the data using page.$(selector), or if no such element you could also wait for the specific request 's response:
const waitForResponse = (page, url) => {
return new Promise(resolve => {
page.on("response", function callback(response){
if (response.url() === url) {
resolve(response);
page.removeListener("response",callback)
}
})
})
};
const res = await waitForResponse(page,"url of the request you want to wait for");
Wait for Network request before continuing process
I am working on a scraper . I am using Phantom JS along with Node JS. Phantom JS loads the page with async function, just like : var status = await page.open(url). Sometimes, because of the slow internet the page takes longer to load and after a time the page status is not returned, to check while its loaded or not. And the page.open() sleeps, which doesn't return anything at all, and all the execution is waiting.
So, my basic question is; is there any way to keep this page.open(url) alive, as the execution of the rest of the code waits until the page is loaded.
My Code is
const phantom = require('phantom');
ph_instance = await phantom.create();
ph_page = await ph_instance.createPage();
var status = await ph_page.open("https://www.cscscholarship.org/");
if (status == 'success') {
console.log("Page is loaded successfully !");
//do more stuff
}
From your comment, it seems like it might be timing out (because of slow internet sometimes)... you can validate this by adding the onResourceTimeout method to your code (link: http://phantomjs.org/api/webpage/handler/on-resource-timeout.html)
It would look something like this:
ph_instance.onResourceTimeout = (request) => {
console.log('Timeout caught:' + JSON.stringify(request));
};
And if that ends up being true, you can increase the default resource timeout settings (link: http://phantomjs.org/api/webpage/property/settings.html) like this:
ph_instance.settings.resourceTimeout = 60000 // 60 seconds
Edit: I know the question is about phantom, but I wanted to also mention another framework I've used for scraping projects before called Puppeteer (link: https://pptr.dev/) I personally found that their API's are easier to understand and code in, and it's currently a maintained project unlike Phantom JS which is not maintained anymore (their last release was two years ago).
What I'm doing
I've been experimenting with Selenium and making a simple program to make my Selenium testing life easier. Part of this is testing webelements and figuring out what methods (clicking submitting ect) make them reload the page, remain static, or become stale without reloading the page. In this question I'm particularly interested in the third case as the first two are already implemented by me.
The problem I'm having
The problem I have is finding a Webelement that goes stale and doesn't cause a page reload. I can't think of a good way to search for one, I don't have the HTML and javascript skills to make one (yet anyways) and I can't verify my code works unless I actually test it.
What I've done/tried
The first thing I thought to look for was a popup but those aren't actually part of the webpage and they're also quite unreliable. I want something thats going to behave consistently because otherwise the test won't work. I think dynamic Webelements, those that change their locators when acted upon will suit my needs but I have no good way of finding them. Any google results for "Self deleting webelement exmaple" or "Webelement goes stale doesn't cause page reload example" or similar, will only give me questions on stackoverflow like this one rather than what I want - concrete examples. The code I'm running simply waits for a staleReferenceException and for an onload event in javascript. If the staleReferenceException occurs but the onload event does not, then I know I've found a self-deleting / dynamic webelement (at least thats what I think is the proper way to detect this). Here is the code I'm running:
try {
//wait until the element goes stale
wait.until(ExpectedConditions.stalenessOf(webElement));
//init the async javascript callback script
String script = "var callback = arguments[arguments.length - 1];" +
"var classToCall = 'SeleniumTest.isPageReloaded';" +
"window.addEventListener('onload'," + "callback(classToCall));";
//execute the script and wait till it returns (unless timeout exceeded)
JavascriptExecutor js = (JavascriptExecutor) driver;
//execute the script and return the java classname to call
//if/when the callback function returns normally
String classToCall = (String) js.executeAsyncScript(script);
clazz = Class.forName(classToCall);
callbackMethod = clazz.getMethod("JavascriptWorking");
callbackMethod.invoke(null,null);
//page reloaded
threadcase = 1;
}
//waiting until webElement becomes stale exceeded timeoutSeconds
catch (TimeoutException e) {
//page was static
threadcase = 2;
}
//waiting until webElement Reloaded the page exceeded timeoutSeconds
catch (ScriptTimeoutException e) {
//the webElement became stale BUT didn't cause a page reload.
threadcase = 3;
As you can notice above there is an int variable named threadcase in this code. The three 'cases' starting from 1 (0 was the starting value which represented a program flow error) represent the three (non-error) possible results of this test:
the page reloads
the page remains static, webelement doesn't change
the page remains static, webelement changes
And I need a good example with which to test the third case.
Solutions I've considered
I've done some basic research into removing webelements in javascript but I A: don't even know if I can act on the page in Selenium like that and B: I'd rather get a test case that just uses the Webpage as is since introducing my edits makes the validity of my testcase reliant on more of my code (which is bad!). So what I need is a good way of finding a webelement that matches my criteria without having to scour the internet with the f12 window open hoping to find that one button that does what I need.
Edit 1
I just tried doing this test more manually, it was suggested in an answer that I manually delete a webelement at the right time and then test my program that way. What I tested was the Google homepage. I tried using the google apps button because when clicked it doesn't cause the whole page to reload. So my thinking was, I'll click it, halt program execution, manually delete it, run the rest of my code, and since no onload events will occur, my program will pass the test. To my suprise thats not what happened.
The exact code I ran is the below. I had my debug stop on the first line:
1 Method callbackMethod = null;
2 try {
3 //wait until the element goes stale
4 wait.until(ExpectedConditions.stalenessOf(webElement));
5 //init the async javascript callback script
6 String script = "var callback = arguments[arguments.length - 1];" +
7 "var classToCall = 'SeleniumTest.isPageReloaded';" +
8 "window.addEventListener('onload', callback(classToCall));";
9 //execute the script and wait till it returns (unless timeout
10 //exceeded)
11 JavascriptExecutor js = (JavascriptExecutor) driver;
12 //execute the script and return the java classname to call if/when
13 //the callback function returns normally
14 String classToCall = (String) js.executeAsyncScript(script);
15 clazz = Class.forName(classToCall);
16 callbackMethod = clazz.getMethod("JavascriptWorking");
17 callbackMethod.invoke(null,null);
18 //page reloaded
19 threadcase = 1;
20 }
21 //waiting until webElement becomes stale exceeded timeoutSeconds
22 catch (TimeoutException e) {
23 //page was static
24 threadcase = 2;
25 }
26 //waiting until webElement Reloaded the page exceeded
27 //timeoutSeconds
28 catch (ScriptTimeoutException e) {
29 //the webElement became stale BUT didn't cause a page reload.
30 threadcase = 3;
31 //trying to get the class from javascript callback failed.
32 }
whats supposed to happen is that a Stale webelement causes the program to stop waiting on line 4, the program progresses, initializes the Javascript callback in lines 6-11 and then on line 14 the call to executeAsyncScript is SUPPOSED to wait untill an 'onload' event which should only occur if the page reloads. Right now its not doing that or I'm blind. I must be confusing the program flow because I'm 99% certain that there are no page reloads happening when I manipulate the DOM to delete the webelement I'm clicking on.
This is the URL I'm trying:
https://www.google.com/webhp?gws_rd=ssl
Simple google homepage, the button I'm deleting is the google apps button (the black 9-grid in the top right)
some info on that element:
class="gb_8 gb_9c gb_R gb_g"
id="gbwa"
Its the general container element for the button itself and the dropdown it creates. I'm deleting this when my program hits the STOP on line 1. Then I go through my program in the debugger. Note (you may have to click inspect element on the button more than once to focus in on it). I'm going to try deleting lower level elements rather than the whole container and see if that changes anything but still this behavior baffles me. The goal here is to get the program flow to threadcase 3 because thats the one we are testing for. There should be no page reloads BUT the webelement should become stale after I manually delete it. I don't have any clue why the javascript callback is running when I can't see a page reload. Let me know if you need more info on what exactly I'm deleting on the google homepage and I'll try sending a picture (with optional freehand circles of course).
I would think that you could debug through a test, place a breakpoint at a suitable point, then use the browsers dev tools to manually update the HTML.
Obviously, if you want this to be a repeatable process it is not an option, but if you are just investigating, then a manual intervention could be suitable
Using python selenium webdriver with it's driver.get_log() functionality to read console errors from browser.
It works ok, except for the fact that I'm interested in errors occuring after the page has completed its load (ad player which keeps loading content)
I tried running execute_async_script() in order to make the app wait 10 seconds while logging the errors, but it fails when the javascript ends due to "bad response from script".
Also tried implicitly_wait() and set_script_timeout() but no luck.
How can I accomplish this?
Thought about creating an infinite loop preventing the page to get to its finished loading event, but I'm not sure how to do that or whether it will cause another load of errors log which I'm not interested in.
You can request for document.readyState as javascript which will return complete when page is loaded. Obviously it is blocking, and needs some kind of timeout.
def wait_for_page_to_load():
state = ''
counter = 0
while state != 'complete':
counter += 1
if counter >60:
return False
state = _driver.execute_script('return document.readyState;')
sleep(5)
return True
I made it, using execute_async_script() together with set_script_timeout().
I set the async script to:
var start = new Date().getTime(); var end = start; while(end < start + 10000) { end = new Date().getTime(); }
And run it right after calling driver.get(url)
All I have to do now to play with the timout is to change x in set_script_timeout(x)