How to get url of video element of an iframe using puppeteer - javascript

Good afternoon friend,
I am trying to be able to take the video from the url of the iframe, from what I see I have to first click the video so that the video element is visible in html.
Isn't there a way to automate the process?
Click automatically when entering the url and extract the url from the video
I'm supposed to get the following url from the video at the end
https://fvs.io/redirector?token=aVVHRmZNVzZVdldkRXJUZXdrSWRQV2RxQ2RSSjdFNGphTVBVQTVBRTR4TlpFYXdMbzlXaktueW9ETW5ma2QvYjlOZG42Mzg2eGNWSDNjT3BHUC8wMmxyUTcrZyt4ZzRwV0s4UWVLcWQzZExzdUVBN1dIbUVmSVhrbnlIWENwWHhFR09LRVBHcXpLUmg4NFlCaW10SzBGeVU2VXVNL3FvMjpUMXRDKytHYng5S1RTTU1laG0vbFZRPT0
index.js
const puppeteer = require('puppeteer');
const main = async() =>{
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto('https://feurl.com/v/2625zf2pddy2gew' , {waituntil: "networkidle0"})
await page.waitFor(1000)
await page.mainFrame().click();
await browser.close()
}

Related

Puppeteer page.content() not matching the html in chromium

When I try to run this:
const puppeteer = require('puppeteer')
async function start() {
const browser = await puppeteer.launch({headless: false, slowMo: 250})
const page = await browser.newPage()
await page.goto('https://www.lanuv.nrw.de/umwelt/luft/immissionen/messorte-und-werte', {waitUntil: "domcontentloaded"})
console.log(await page.content())
await browser.close()
}
start()
The page.content() does not match the html I can see in a browser. I´m assuming the website is JS-based and I am just getting the original (not hydrated) content of the website when it reached chromium.
How do I get the Html content I can see in chromium? I actually do not need the entire page, just some data of it. But page.$$ was not helpful either.

Puppeteer can´t find the absolute path of the page

Im trying to make a puppeteer web-scraping script with javascript, that allow me to get the email of many course sellers on that site and when I try to make it click on one, it return me an error of "Evaluation failed", im using this example :(https://www.tutorialspoint.com/puppeteer/puppeteer_absolute_xpath.htm), and i allready tried in my machine and it works, my guess is that im not giving the right path but in the example they tell you to just copy the absolute path of the element you want to click it and it should work. (sorry for my poor english, not my first language) Anyway Here is my code:
const puppeteer = require('puppeteer');
async function robo(){
const browser = await puppeteer.launch({headless:false, defaultViewport: false});
const page = await browser.newPage();
await page.goto('https://app-vlc.hotmart.com/market/search?categoryId=25&page=1&userLanguage=PT_BR');
await page.type('[data-test-id="login-email"]', 'my e-mail')
await page.type('[data-test-id="login-password"]', 'my password')
await page.click('[data-test-id="login-submit"]')
await page.waitForTimeout(7000)
await page.click("/html/body/div[2]/div[1]/div[2]/main/div[2]//div/div[2]/div[3]/div[1]")[0]
await page.waitForTimeout(4000)
//obtain URL after click
console.log(await page.url())
}
//browser.close();
//await browser.close();
robo();```

How do you scroll down on subscrollbar in Puppeteer?

I am new to using Puppeteer and I am trying to scrape some store hour information from a store website. When you open the section that has this information, there is a scroll bar (not the actual browser one) that you need to scroll all the way down to the bottom to load all of the data. Currently, I am extracting all the data on the page using the code from user codetinker from this stack overflow answer.
It is:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://netto.dk/find-butik/?mapData={%22coordinates%22:{%22lat%22:54.41413535660238,%22lng%22:13.985595703125002},%22zoom%22:7,%22input%22:%22%22}', { waitUntil: 'networkidle0' });
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
But this doesn't work, as the sub scrollbar hasn't been scrolled all the way down when Puppeteer does its magic.
What I want to accomplish is have Puppeteer scroll all the way down in this "sub" scrollbar so I can scrape all the data I need. Is there any way to accomplish this?

Puppeteer goto with query parameters does not work

I am using Puppeteer page.goto(url) to to navigate to a page that ends with with .html?page=2
So a page I would call would look something like this:
https://www.billa.at/search/results?category=&searchTerm=brot&page=2
Here is my code:
const browser = await Puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
console.log(page.url())
Unfortunately the query params are ignored. When I log page.url() the return value is the page that I have called without the query params which looks like this:
https://www.billa.at/search/results?category=&searchTerm=brot
Executing page.goto(url, { waitUntil: [anyhting] }) results in a timeout
Help would be greatly appreciated.

Puppeteer Error, Cannot read property 'getProperty' of undefined while scraping white pages

I'm trying to scrape an address from whitepages.com, but my scraper keeps throwing this error every time I run it.
(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined
here's my code:
const puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.
I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.
Why would this be happening and is there any way I can fix it?
You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().
(async() => {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
} catch (err) {
console.error(err.message);
} finally {
await browser.close();
}
})();
The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.
I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.
const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues
Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.
I would suggest you to debug by using a headless browser:
const browser = await puppeteer.launch({headless: false});
I took the code that #mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:
first you must install a couple things in console bash so run the following two commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
Installing these will allow you to run the first few lines in #mbit 's code.
Then in this line of code:
const browser = await puppeteer.launch();
as a parameter to puppeteer.launch(); pass in the following:
{headless: false}
which should in turn look like this:
const browser = await puppeteer.launch({headless: false});
I also believe that the Path that #mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.
await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});
scrapeAddress({address})
After you do this you should be able to run your code and retrieve values
Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!
let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const [el]= await page.$x('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")

Categories

Resources