I have a script made using node.js and puppeteer which downloads a file from a button (which doesn't redirect to a url), so right now i'm using await await page.waitForTimeout(1000); to wait for the download to complete but it has a few flaws, such as:
Depending on the connection, the download might take more than 1000ms to finish, as well as it might take less, which wouldn't make sense to wait more than what took to finish the download.
My question is, is there a way to wait for a download to complete using Node+Puppeteer? I have tried using waitUntil: 'networkidle0 and networkidle2 but both seem to wait forever.
Code below:
const path = require('path');
const puppeteer = require('puppeteer');
(async () => {
/* Initialize some variables */
const browser = await puppeteer.launch();
// Instantiates a new page
const page = await browser.newPage();
// Gets current path
const downloadPath = path.resolve('./');
// Specifies wether it allows downloading multiple files or not
await page._client.send('Page.setDownloadBehavior',
{behavior: 'allow', downloadPath: downloadPath});
// Goes to My Website
await page.goto('http://localhost:8080/mywebsite');
// Exports to CSV
await page.waitForSelector("#W0009EXPORTAXLS > a > i", {visible: true});
await page.tap("#W0009EXPORTAXLS > a > i");
await page.waitForTimeout(1000);
// Log
console.log('File exported.');
// Closes the browser
await browser.close();
})();
Related
When I try to run this:
const puppeteer = require('puppeteer')
async function start() {
const browser = await puppeteer.launch({headless: false, slowMo: 250})
const page = await browser.newPage()
await page.goto('https://www.lanuv.nrw.de/umwelt/luft/immissionen/messorte-und-werte', {waitUntil: "domcontentloaded"})
console.log(await page.content())
await browser.close()
}
start()
The page.content() does not match the html I can see in a browser. I´m assuming the website is JS-based and I am just getting the original (not hydrated) content of the website when it reached chromium.
How do I get the Html content I can see in chromium? I actually do not need the entire page, just some data of it. But page.$$ was not helpful either.
I'm trying to scrape an address from whitepages.com, but my scraper keeps throwing this error every time I run it.
(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined
here's my code:
const puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.
I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.
Why would this be happening and is there any way I can fix it?
You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().
(async() => {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
} catch (err) {
console.error(err.message);
} finally {
await browser.close();
}
})();
The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.
I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.
const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues
Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.
I would suggest you to debug by using a headless browser:
const browser = await puppeteer.launch({headless: false});
I took the code that #mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:
first you must install a couple things in console bash so run the following two commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
Installing these will allow you to run the first few lines in #mbit 's code.
Then in this line of code:
const browser = await puppeteer.launch();
as a parameter to puppeteer.launch(); pass in the following:
{headless: false}
which should in turn look like this:
const browser = await puppeteer.launch({headless: false});
I also believe that the Path that #mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.
await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});
scrapeAddress({address})
After you do this you should be able to run your code and retrieve values
Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!
let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const [el]= await page.$x('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")
How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.
However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling
html_contents = await page.content()
and saving the results, but that saves a copy without any non-HTML elements.
Is there way to save webpages for offline use with Puppeteer?
It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:
'use strict';
const puppeteer = require('puppeteer');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://en.wikipedia.org/wiki/MHTML');
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
Is there a way to record a performance timeline for tests run with Puppeteer?
(source: google.com)
Yes, just use page.tracing methods like in this example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.tracing.start({ path: 'trace.json' });
await page.goto('https://en.wikipedia.org');
await page.tracing.stop();
await browser.close();
})();
And then load trace.json file in Chrome Performance tab. If you want more details here is an article with a chapter dedicated to analyzing page tracing.
So I am trying to use puppeteer to automate some data entry functions in Oracle Cloud applications.
As of now I am able to launch the cloud app login page, enter username and password credentials and click login button. Once login is successful, Oracle opens a homepage for the user. Once this happens if I take screenshot or execute a page.content the screenshot and the content html is from the login page not of the homepage.
How do I always have a reference to the current page that the user is on?
Here is the basic code so far.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: false});
let page = await browser.newPage();
await page.goto('oraclecloudloginurl', {waitUntil: 'networkidle2'});
await page.type('#userid', 'USERNAME', {delay : 10});
await page.type('#password', 'PASSWORD', {delay : 10});
await page.waitForSelector('#btnActive', {enabled : true});
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
await page.screenshot({path: 'home.png'});
const html = await page.content();
await fs.writeFileSync('home.html', html);
await page.waitFor(10000);
await browser.close();
})();
With this the user logs in fine and the home page is displayed. But I get an error after that when I try to screenshot the homepage and render the html content. It seems to be the page has changed and I am referring to the old page. How can I refer to the context of the current page?
Below is the error:
(node:14393) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id undefined
This code looks problematic for two reasons:
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
The first problem is that the page.click().then() spins off a totally separate promise chain:
page.click() --> .then(...)
|
v
page.waitForNavigation()
|
v
page.screenshot(...)
|
v
...
This means the click that triggers the navigation and the navigation are running in parallel and can never be rejoined into the same promise chain. The usual solution here is to tie them into the same promise chain:
// Note: this code is still broken; keep reading!
await page.click('#btnActive', {delay : 1000});
console.log('Login Button Clicked');
await page.waitForNavigation();
This adheres to the principle of not mixing then and await unless you have good reason to.
But the above code is still broken because Puppeteer requires the waitForNavigation() promise to be set before the event that triggers navigation is fired. The fix is:
await Promise.all([
page.waitForNavigation(),
page.click('#btnActive', {delay : 1000}),
]);
or
const navPromise = page.waitForNavigation(); // no await
await page.click('#btnActive', {delay : 1000});
await navPromise;
Following this pattern, Puppeteer should no longer be confused about its context.
Minor notes:
'networkidle2' is slow and probably unnecessary, especially for a page you're soon going to be navigating away from. I'd default to 'domcontentloaded'.
await page.waitFor(10000); is deprecated along with page.waitForTimeout(), although I realize this is an older post.