Puppeteer can´t find the absolute path of the page - javascript

Im trying to make a puppeteer web-scraping script with javascript, that allow me to get the email of many course sellers on that site and when I try to make it click on one, it return me an error of "Evaluation failed", im using this example :(https://www.tutorialspoint.com/puppeteer/puppeteer_absolute_xpath.htm), and i allready tried in my machine and it works, my guess is that im not giving the right path but in the example they tell you to just copy the absolute path of the element you want to click it and it should work. (sorry for my poor english, not my first language) Anyway Here is my code:
const puppeteer = require('puppeteer');
async function robo(){
const browser = await puppeteer.launch({headless:false, defaultViewport: false});
const page = await browser.newPage();
await page.goto('https://app-vlc.hotmart.com/market/search?categoryId=25&page=1&userLanguage=PT_BR');
await page.type('[data-test-id="login-email"]', 'my e-mail')
await page.type('[data-test-id="login-password"]', 'my password')
await page.click('[data-test-id="login-submit"]')
await page.waitForTimeout(7000)
await page.click("/html/body/div[2]/div[1]/div[2]/main/div[2]//div/div[2]/div[3]/div[1]")[0]
await page.waitForTimeout(4000)
//obtain URL after click
console.log(await page.url())
}
//browser.close();
//await browser.close();
robo();```

Related

I am trying to scrape an article from a website from its XCode with JavaScript but I am getting error "undefined"

I was following a quick scraping tutorial, but after following each step I am not able to replicate what was shown.
Basically, I want to scrape an article title by copying XCode of the element from a website, but I keep getting errors. Have a look at my code below.
const puppeteer = require('puppeteer');
async function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el] = await page.$x('/html/body/div[2]/div[4]/div/div[1]/section[1]/div[2]/section/div/div/div/div[2]/a[2]/span[2]/span');
const src = await el.getProperty('src');
const srcTxt = await src.jsonValue();
console.log(srcTxt);
}
scrapeProduct('https://www.kerkida.net/');
I receive an error code "undefined".
Node.js and Puppeteer are already installed.
Please advise with possible solutions.
I tried scraping an article title by copying the XCode of the element from a website and expected to receive it in JSON format.

Puppeteer page.content() not matching the html in chromium

When I try to run this:
const puppeteer = require('puppeteer')
async function start() {
const browser = await puppeteer.launch({headless: false, slowMo: 250})
const page = await browser.newPage()
await page.goto('https://www.lanuv.nrw.de/umwelt/luft/immissionen/messorte-und-werte', {waitUntil: "domcontentloaded"})
console.log(await page.content())
await browser.close()
}
start()
The page.content() does not match the html I can see in a browser. I´m assuming the website is JS-based and I am just getting the original (not hydrated) content of the website when it reached chromium.
How do I get the Html content I can see in chromium? I actually do not need the entire page, just some data of it. But page.$$ was not helpful either.

Puppeteer Error, Cannot read property 'getProperty' of undefined while scraping white pages

I'm trying to scrape an address from whitepages.com, but my scraper keeps throwing this error every time I run it.
(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined
here's my code:
const puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.
I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.
Why would this be happening and is there any way I can fix it?
You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().
(async() => {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
} catch (err) {
console.error(err.message);
} finally {
await browser.close();
}
})();
The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.
I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.
const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues
Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.
I would suggest you to debug by using a headless browser:
const browser = await puppeteer.launch({headless: false});
I took the code that #mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:
first you must install a couple things in console bash so run the following two commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
Installing these will allow you to run the first few lines in #mbit 's code.
Then in this line of code:
const browser = await puppeteer.launch();
as a parameter to puppeteer.launch(); pass in the following:
{headless: false}
which should in turn look like this:
const browser = await puppeteer.launch({headless: false});
I also believe that the Path that #mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.
await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});
scrapeAddress({address})
After you do this you should be able to run your code and retrieve values
Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!
let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const [el]= await page.$x('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")

login into gmail fails for unknown reason

I am trying to login into my gmail with puppeteer to lower the risk of recaptcha
here is my code
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.waitForSelector('#identifierId' , { visible: true });
await page.type('#identifierId' , 'myemail');
await Promise.all([
page.click('#identifierNext') ,
page.waitForSelector('.whsOnd' , { visible: true })
])
await page.type('#password .whsOnd' , "mypassword");
await page.click('#passwordNext');
await page.waitFor(5000);
but i always end up with this message
I even tried to just open the login window with puppeteer and fill the login form manually myself, but even that failed.
Am I missing something ?
When I look into console there is a failed ajax call just after login.
Request URL: https://accounts.google.com/_/signin/challenge?hl=en&TL=APDPHBCG5lPol53JDSKUY2mO1RzSwOE3ZgC39xH0VCaq_WHrJXHS6LHyTJklSkxd&_reqid=464883&rt=j
Request Method: POST
Status Code: 401
Remote Address: 216.58.213.13:443
Referrer Policy: no-referrer-when-downgrade
)]}'
[[["er",null,null,null,null,401,null,null,null,16]
,["e",2,null,null,81]
]]
I've inspected your code and it seems to be correct despite of some selectors. Also, I had to add a couple of timeouts in order to make it work. However, I failed to reproduce your issue so I'll just post the code that worked for me.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.screenshot({path: './1.png'});
...
})();
Please, note that I run browser in normal, not headless mode. If you take a look at screenshot at this position, you will see that it is correct Google login form
The rest of the code is responsible for entering password
const puppeteer = require('puppeteer');
(async () => {
...
await page.waitForSelector('#identifierId', {visible: true});
await page.type('#identifierId', 'my#email');
await Promise.all([
page.click('#identifierNext'),
page.waitForSelector('.whsOnd', {visible: true})
]);
await page.waitForSelector('input[name=password]', {visible: true});
await page.type('input[name=password]', "my.password");
await page.waitForSelector('#passwordNext', {visible: true});
await page.waitFor(1000);
await page.click('#passwordNext');
await page.waitFor(5000);
})();
Please also note few differences from your code - the selector for password field is different. I had to add await page.waitForSelector('#passwordNext', {visible: true}); and a small timeout after that so the button could be clicked successfully.
I've tested all the code above and it worked successfully. Please, let me know if you still need some help or facing troubles with my example.
The purpose of question is to login to Gmail. I will share another method that does not involve filling email and password fields on puppeteer script
and works in headless: true mode.
Method
Login to your gmail using normal browser (google chrome preferebbly)
Export all cookies for the gmail tab
Use page.setCookie to import the cookies to your puppeteer instance
Login to gmail
This should be no brainer.
Export all cookies
I will use an extension called Edit This Cookie, however you can use other extensions or manual methods to extract the cookies.
Click the browser icon and then click the Export button.
Import cookies to puppeteer instance
We will save the cookies in a cookies.json file and then import using page.setCookie function before navigation. That way when gmail page loads, it will have login information right away.
The code might look like this.
const puppeteer = require("puppeteer");
const cookies = require('./cookies.json');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set cookies here, right after creating the instance
await page.setCookie(...cookies);
// do the navigation,
await page.goto("https://mail.google.com/mail/u/0/#search/stackoverflow+survey", {
waitUntil: "networkidle2",
timeout: 60000
});
await page.screenshot({ path: "example.png" });
await browser.close();
})();
Result:
Notes:
It was not asked, but I should mention following for future readers.
Cookie Expiration: Cookies might be short lived, and expire shortly after, behave differently on a different device. Logging out on your original device will log you out from the puppeteer as well since it's sharing the cookies.
Two Factor: I am not yet sure about 2FA authentication. It did not ask me about 2FA probably because I logged in from same device.

How do I reference the current page object in puppeteer once user moves from login to homepage?

So I am trying to use puppeteer to automate some data entry functions in Oracle Cloud applications.
As of now I am able to launch the cloud app login page, enter username and password credentials and click login button. Once login is successful, Oracle opens a homepage for the user. Once this happens if I take screenshot or execute a page.content the screenshot and the content html is from the login page not of the homepage.
How do I always have a reference to the current page that the user is on?
Here is the basic code so far.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: false});
let page = await browser.newPage();
await page.goto('oraclecloudloginurl', {waitUntil: 'networkidle2'});
await page.type('#userid', 'USERNAME', {delay : 10});
await page.type('#password', 'PASSWORD', {delay : 10});
await page.waitForSelector('#btnActive', {enabled : true});
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
await page.screenshot({path: 'home.png'});
const html = await page.content();
await fs.writeFileSync('home.html', html);
await page.waitFor(10000);
await browser.close();
})();
With this the user logs in fine and the home page is displayed. But I get an error after that when I try to screenshot the homepage and render the html content. It seems to be the page has changed and I am referring to the old page. How can I refer to the context of the current page?
Below is the error:
(node:14393) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id undefined
This code looks problematic for two reasons:
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
The first problem is that the page.click().then() spins off a totally separate promise chain:
page.click() --> .then(...)
|
v
page.waitForNavigation()
|
v
page.screenshot(...)
|
v
...
This means the click that triggers the navigation and the navigation are running in parallel and can never be rejoined into the same promise chain. The usual solution here is to tie them into the same promise chain:
// Note: this code is still broken; keep reading!
await page.click('#btnActive', {delay : 1000});
console.log('Login Button Clicked');
await page.waitForNavigation();
This adheres to the principle of not mixing then and await unless you have good reason to.
But the above code is still broken because Puppeteer requires the waitForNavigation() promise to be set before the event that triggers navigation is fired. The fix is:
await Promise.all([
page.waitForNavigation(),
page.click('#btnActive', {delay : 1000}),
]);
or
const navPromise = page.waitForNavigation(); // no await
await page.click('#btnActive', {delay : 1000});
await navPromise;
Following this pattern, Puppeteer should no longer be confused about its context.
Minor notes:
'networkidle2' is slow and probably unnecessary, especially for a page you're soon going to be navigating away from. I'd default to 'domcontentloaded'.
await page.waitFor(10000); is deprecated along with page.waitForTimeout(), although I realize this is an older post.

Categories

Resources