Puppeteer page.content() not matching the html in chromium - javascript

When I try to run this:
const puppeteer = require('puppeteer')
async function start() {
const browser = await puppeteer.launch({headless: false, slowMo: 250})
const page = await browser.newPage()
await page.goto('https://www.lanuv.nrw.de/umwelt/luft/immissionen/messorte-und-werte', {waitUntil: "domcontentloaded"})
console.log(await page.content())
await browser.close()
}
start()
The page.content() does not match the html I can see in a browser. I´m assuming the website is JS-based and I am just getting the original (not hydrated) content of the website when it reached chromium.
How do I get the Html content I can see in chromium? I actually do not need the entire page, just some data of it. But page.$$ was not helpful either.

Related

Puppeteer can´t find the absolute path of the page

Im trying to make a puppeteer web-scraping script with javascript, that allow me to get the email of many course sellers on that site and when I try to make it click on one, it return me an error of "Evaluation failed", im using this example :(https://www.tutorialspoint.com/puppeteer/puppeteer_absolute_xpath.htm), and i allready tried in my machine and it works, my guess is that im not giving the right path but in the example they tell you to just copy the absolute path of the element you want to click it and it should work. (sorry for my poor english, not my first language) Anyway Here is my code:
const puppeteer = require('puppeteer');
async function robo(){
const browser = await puppeteer.launch({headless:false, defaultViewport: false});
const page = await browser.newPage();
await page.goto('https://app-vlc.hotmart.com/market/search?categoryId=25&page=1&userLanguage=PT_BR');
await page.type('[data-test-id="login-email"]', 'my e-mail')
await page.type('[data-test-id="login-password"]', 'my password')
await page.click('[data-test-id="login-submit"]')
await page.waitForTimeout(7000)
await page.click("/html/body/div[2]/div[1]/div[2]/main/div[2]//div/div[2]/div[3]/div[1]")[0]
await page.waitForTimeout(4000)
//obtain URL after click
console.log(await page.url())
}
//browser.close();
//await browser.close();
robo();```

How to get url of video element of an iframe using puppeteer

Good afternoon friend,
I am trying to be able to take the video from the url of the iframe, from what I see I have to first click the video so that the video element is visible in html.
Isn't there a way to automate the process?
Click automatically when entering the url and extract the url from the video
I'm supposed to get the following url from the video at the end
https://fvs.io/redirector?token=aVVHRmZNVzZVdldkRXJUZXdrSWRQV2RxQ2RSSjdFNGphTVBVQTVBRTR4TlpFYXdMbzlXaktueW9ETW5ma2QvYjlOZG42Mzg2eGNWSDNjT3BHUC8wMmxyUTcrZyt4ZzRwV0s4UWVLcWQzZExzdUVBN1dIbUVmSVhrbnlIWENwWHhFR09LRVBHcXpLUmg4NFlCaW10SzBGeVU2VXVNL3FvMjpUMXRDKytHYng5S1RTTU1laG0vbFZRPT0
index.js
const puppeteer = require('puppeteer');
const main = async() =>{
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto('https://feurl.com/v/2625zf2pddy2gew' , {waituntil: "networkidle0"})
await page.waitFor(1000)
await page.mainFrame().click();
await browser.close()
}

Puppeteer Error, Cannot read property 'getProperty' of undefined while scraping white pages

I'm trying to scrape an address from whitepages.com, but my scraper keeps throwing this error every time I run it.
(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined
here's my code:
const puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.
I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.
Why would this be happening and is there any way I can fix it?
You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().
(async() => {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
} catch (err) {
console.error(err.message);
} finally {
await browser.close();
}
})();
The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.
I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.
const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues
Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.
I would suggest you to debug by using a headless browser:
const browser = await puppeteer.launch({headless: false});
I took the code that #mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:
first you must install a couple things in console bash so run the following two commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
Installing these will allow you to run the first few lines in #mbit 's code.
Then in this line of code:
const browser = await puppeteer.launch();
as a parameter to puppeteer.launch(); pass in the following:
{headless: false}
which should in turn look like this:
const browser = await puppeteer.launch({headless: false});
I also believe that the Path that #mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.
await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});
scrapeAddress({address})
After you do this you should be able to run your code and retrieve values
Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!
let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const [el]= await page.$x('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")

How to use puppeteer to automante Amazon Connect CCP login?

I'm trying use puppeteer to automate the login process for our agents in Amazon Connect however I can't get puppeteer to finish loading the CCP login page. See code below:
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = 'https://ccalderon-reinvent.awsapps.com/connect/ccp#/';
await page.goto(url, {waitUntil: 'domcontentloaded'});
console.log(await page.content());
// console.log('waiting for username input');
// await page.waitForSelector('#wdc_username');
await browser.close();
I can never see the content of the page, it times out. Am I doing something wrong? If I launch the browser with { headless: false } I can see the page never finishes loading.
Please note the same code works fine with https://www.github.com/login so it must be something specific to the source code of Connect's CCP.
In case you are from future and having problem with puppeteer for no reason, try to downgrade the puppeteer version first and see if the issue persists.
This seems like a bug with Chromium Development Version 73.0.3679.0, The error log said it could not load specific script somehow, but we could still load the script manually.
The Solution:
Using Puppeteer version 1.11.0 solved this issue. But if you want to use puppeteer version 1.12.2 but with a different chromium revision, you can use the executablePath argument.
Here are the respective versions used on puppeteer (at this point of answer),
Chromium 73.0.3679.0 - Puppeteer v1.12.2
Chromium 72.0.3582.0 - Puppeteer v1.11.0
Chromium 71.0.3563.0 - Puppeteer v1.9.0
Chromium 70.0.3508.0 - Puppeteer v1.7.0
Chromium 69.0.3494.0 - Puppeteer v1.6.2
I checked my locally installed chrome,which was loading the page correctly,
$(which google-chrome) --version
Google Chrome 72.0.3626.119
Note: The puppeteer team suggested on their doc to specifically use the chrome provided with the code (most likely the latest developer version) instead of using different revisions.
Also I edited the code a little bit to finish loading when all network requests is done and the username input is visible.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
executablePath: "/usr/bin/google-chrome"
});
const page = await browser.newPage();
const url = "https://ccalderon-reinvent.awsapps.com/connect/ccp#/";
await page.goto(url, { waitUntil: "networkidle0" });
console.log("waiting for username input");
await page.waitForSelector("#wdc_username", { visible: true });
await page.screenshot({ path: "example.png" });
await browser.close();
})();
The specific revision number can be obtained in many ways, one is to check the package.json of puppeteer package. The url for 1.11.0 is,
https://github.com/GoogleChrome/puppeteer/blob/v1.11.0/package.json
If you like to automate the chrome revision downloading, you can use browserFetcher to fetch specific revision.
const browserFetcher = puppeteer.createBrowserFetcher();
const revisionInfo = await browserFetcher.download('609904'); // chrome 72 is 609904
const browser = await puppeteer.launch({executablePath: revisionInfo.executablePath})
Result:

Puppeteer performance timeline?

Is there a way to record a performance timeline for tests run with Puppeteer?
(source: google.com)
Yes, just use page.tracing methods like in this example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.tracing.start({ path: 'trace.json' });
await page.goto('https://en.wikipedia.org');
await page.tracing.stop();
await browser.close();
})();
And then load trace.json file in Chrome Performance tab. If you want more details here is an article with a chapter dedicated to analyzing page tracing.

Categories

Resources