Why will puppeteer not click on the video - javascript

I am currently writing a simple program that grabs the name of a song from my discord bot, finds the video and passes it to a function to convert to mp3. My problem is that puppeteer dosen't click on the video and instead just returns the search page link.
Here is my code to grab the link and pass it through download:
async function findSongName(stringWithName){
let stringName = stringWithName.replace(commands.play, '')
const link = 'https://www.youtube.com';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(link)
await page.type('ytd-searchbox#search.style-scope.ytd-masthead', stringName);
page.keyboard.press('Enter');
await page.click('yt-interaction#extended');
console.log(page.url())
await browser.close()
}

It sounds like you want to get the title and URL of the top result for a YT search. For starters, you don't need to start at the YT homepage. Just navigate to https://www.youtube.com/results?search_query=${yourQuery} to speed things up and reduce complexity.
Next, if you view the page source of /results, there's a large (~1 MB) global data structure called ytInitialData that has all of the relevant results in it (along with a lot of other irrelevant stuff, admittedly). Theoretically, you could grab the page with Axios, parse out ytInitialData with Cheerio, grab your data using plain array/object JS and skip Puppeteer entirely.
Of course, using the YT search API is the most reliable and proper way.
Since you're using Puppeteer, though, the data can be pulled out of the "#items a#video-title" elements as follows:
const puppeteer = require("puppeteer");
const searchYT = async (page, searchQuery) => {
const encodedQuery = encodeURIComponent(searchQuery);
const url = `https://www.youtube.com/results?search_query=${encodedQuery}`;
await page.goto(url);
const sel = "a#video-title";
await page.waitForSelector(sel);
return page.$$eval(sel, els =>
els.map(e => ({
title: e.textContent.trim(),
href: e.href,
}))
);
};
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
req.resourceType() === "image" ? req.abort() : req.continue();
});
const results = await searchYT(page, "stack overflow");
console.log(results);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output (for the search term "stack overflow"):
[
{
title: 'Stack Overflow is full of idiots.',
href: 'https://www.youtube.com/watch?v=I_ZK0t9-llo'
},
{
title: "How To Use Stack Overflow (no, ForrestKnight, it's not full of idiots)",
href: 'https://www.youtube.com/watch?v=sMIslcynm0Q'
},
{
title: 'How to use Stack Overflow as a Beginner ?',
href: 'https://www.youtube.com/watch?v=Vt-Wf7d0CFo'
},
{
title: 'How Microsoft Uses Stack Overflow for Teams',
href: 'https://www.youtube.com/watch?v=mhh0aK6yJgA'
},
// ...
]
Since you only want the first result, then it's here, but if you want more than the initial batch, either work through ytInitialData as described above or scroll the page down with Puppeteer.
Now that you have a video URL that you want to make into an mp3, I'd recommend youtube-dl. There are Node wrappers you can install to access its API easily, such as node-youtube-dl which was the first result when I searched and I've never used before.

Related

unable to get ElementHandle when filtering puppeteer elements

I am trying to use Puppeteer to scrape data from https://pagespeed.web.dev - I need to be able to take screenshots of the results and while it would be much simpler to use Google's own API to get the raw data, I can't screenshot the actual results that way. The challenge I'm having is filtering out DOM elements while still retaining the ElementHandle nature of the objects. For example, getting the "This URL" / "Origin" buttons:
In a normal JS console, I would run this:
[...document.querySelectorAll('button')].filter(b => b.innerText === 'This URL')
This would give me an Array of DOM elements that I could then run click() on or whatever.
I have tried a number of ways to get Puppeteer to give me a usable ElementHandle object and they have all returned an array of objects, with the sole member of that object being an __incrementalDOMData object:
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(`https://pagespeed.web.dev/report?url=${url}`)
await page.waitForSelector(homepageWaitSelector, { visible: true })
// here's where the fun starts
const buttons = await page.$$eval('button', buttons => buttons.filter(button => button.innerText === 'This URL'))
const buttons = await page.$$eval('button', buttons => buttons.map(b => b.innerText).filter(t => t === 'This URL'))
// This one seems to run because the list of elements returned has the right length, but I can never get a breakpoint to catch inside the `evaluate` method, nor does a `console.log` statement actually print.
const buttons = await page.evaluate(() => {
const b = [...document.querySelectorAll('button')].filter(b => b.innerText === 'This URL')
return b
})
All of those methods end up returning something like this:
[{
__incrementalDOMData: {
j: ['class', 'the-classes-used'],
key: 'a key',
v: true
}]
Because there are so many buttons, and because the class names are all random and reused, I can't just target the ones I want (I mean I suppose I could build a super precise selector), filter them, and then return not just the filtered data but the actual elements themselves. Is what I'm asking for even possible?
Thanks to ggorlen's comment above I simplified my approach and got something working without an overly complex selector or logic:
const buttonContainer = await page.$('div.R3HzDf.ucyZQe')
const thisUrlButton = await buttonContainer.$(':first-child')
await thisUrlButton.evaluate(b => b.click())
await page.screenshot({ path: 'homepage.png' })
const originButton = await buttonContainer.$(':last-child')
await originButton.evaluate(b => b.click())
await page.screenshot({ path: 'origin.png' })

Is there a way to open multiple tabs simultaneously on Playwright or Puppeteer to complete the same tasks?

I just started coding, and I was wondering if there was a way to open multiple tabs concurrently with one another. Currently, my code goes something like this:
const puppeteer = require("puppeteer");
const rand_url = "https://www.google.com";
async function initBrowser() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(rand_url);
await page.setViewport({
width: 1200,
height: 800,
});
return page;
}
async function login(page) {
await page.goto("https://www.google.com");
await page.waitFor(100);
await page.type("input[id ='user_login'", "xxx");
await page.waitFor(100);
await page.type("input[id ='user_password'", "xxx");
}
this is not my exact code, replaced with different aliases, but you get the idea. I was wondering if there was anyone out there that knows the code that allows this same exact browser to be opened on multiple instances, replacing the respective login info only. Of course, it would be great to prevent my IP from getting banned too, so if there was a way to apply proxies to each respective "browser"/ instance, that would be perfect.
Lastly, I would like to know whether or not playwright or puppeteer is superior in the way they can handle these multiple instances. I don't even know if this is a possibility, but please enlighten me. I want to learn more.
You can use multiple browser window as different login/cookies.
For simplicity, you can use the puppeteer-cluster module by Thomas Dondorf.
This module can make your puppeteer launched and queued one by one so that you can use this to automating your login, and even save login cookies for the next launches.
Feel free to go to the Github: https://github.com/thomasdondorf/puppeteer-cluster
const { Cluster } = require('puppeteer-cluster')
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // <= this is the number of
// parallel task running simultaneously
}) // You can change to the number of CPU
const cpuNumber = require('os').cpus().length // for example
await cluster.task(async ({ page, data: [username, password] }) => {
await page.goto('https://www.example.com')
await page.waitForTimeout(100)
await page.type('input[id ="user_login"', username)
await page.waitForTimeout(100)
await page.type('input[id ="user_password"', password)
const screen = await page.screenshot()
// Store screenshot, Save Cookies, do something else
});
cluster.queue(['myFirstUsername', 'PassW0Rd1'])
cluster.queue(['anotherUsername', 'Secr3tAgent!'])
// cluster.queue([username, password])
// username and password array passed into cluster task function
// many more pages/account
await cluster.idle()
await cluster.close()
})()
For Playwright, sadly still unsupported by the module above,you can use browser pool (cluster) module to automating the Playwright launcher.
And for proxy usage, I recommend Puppeteer library as the legendary one.
Don't forget to choose my answer as the right one, if this helps you.
There are profiling and proxy options; you could combine them to achieve your goal:
Profile, https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context
import { chromium } from 'playwright'
const userDataDir = /tmp/ + process.argv[2]
const browserContext = await chromium.launchPersistentContext(userDataDir)
// ...
Proxy, https://playwright.dev/docs/api/class-browsertype#browser-type-launch
import { chromium } from 'playwright'
const proxy = { /* secret */ }
const browser = await chromium.launch({
proxy: { server: 'pre-context' }
})
const browserContext = await browser.newContext({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`,
username: proxy.username,
password: proxy.password,
}
})
// ...

Unable to scrape certain elements with Cheerio

I'm trying to scrape a product page using pupeteer and Cheerio. (this page)
I'm using a data id to scrape the title and image. The problem is that the title never gets scraped while the image does every time.
I've tried scraping the title by class name but that doesn't work either. Does this have something to do with the specfic website I'm trying to scrape? Thank you.
my code:
// Load cheerio
const $ = cheerio.load(data);
/* Scrape Product Page */
const product = [];
// Title
$('[data-testid="product-name"]').each(() => {
product.push({
title: $(this).text(),
});
});
// Image
$('[data-testid="product-detail-image"]').each((index, value) => {
const imgSrc = $(value).attr('src');
product.push({
image: imgSrc,
});
});
As I mentioned in the comments, there's almost no use case I can think of that makes sense with both Puppeteer and Cheerio at once. If the data is static, use Cheerio alongside a simple request library like Axios, otherwise use Puppeteer and skip Cheerio entirely in favor of native Puppeteer selectors.
One other potential reason to use Puppeteer is if your requests library is being blocked by the server's robot detector, as appears to be the case here.
This script worked for me:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const url = "https://stockx.com/nike-air-force-1-low-white-07";
await page.goto(url);
const nameSel = '[data-testid="product-name"]';
await page.waitForSelector(nameSel, {timeout: 60000});
const name = await page.$eval(nameSel, el => el.textContent);
const imgSel = '[data-testid="product-detail-image"]';
await page.waitForSelector(imgSel);
const src = await page.$eval(imgSel, el => el.src);
console.log(name, src);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;

Page doesn't see cookies in Puppeteer

EDIT for Mission Clarity: In the end I am pulling inventory data and customer data from Postgres to render and send a bunch of PDFs to customers, once per month.
These PDFs are dynamic in that the cover page will have varying customer name/address. The next page(s) are also dynamic as they are lists of a particular customer's expiring inventory with item/expirying date/serial number.
I had made a client-side React page with print CSS to render some print-layout letters that could be printed off/saved as a pretty PDF.
Then, the waterfall spec came in that this was to be an automated process on the server. Basically, the PDF needs attached to an email alerting customers of expiring product (in med industry where everything needs audited).
I thought using Puppeteer would be a nice and easy switch. Just add a route that processes all customers, looking up whatever may be expiring, and then passing that into the dynamic react page to be rendered headless to a PDF file (and eventually finish the whole rest of the plan, sending email, etc.). Right now I just grab 10 customers and their expiring stock for PoC, then I have basically: { customer: {}, expiring: [] }.
I've attempted using POST to page with interrupt, but I guess that makes sense that I cannot get post data in the browser.
So, I switched my approach to using cookies. This I would expect to work, but I can never read the cookie(s) into the page.
Here is a: Simple route, simple puppeteer which writes out cookies to a json and takes a screenshot just for proof, and simple HTML with script I'm using just to try to prove I can pass data along.
server/index.js:
app.get('/testing', async (req, res) => {
console.log('GET /testing');
res.sendFile(path.join(__dirname, 'scratch.html'));
});
scratch.js (run at commandline node ./scratch.js:
const puppeteer = require('puppeteer')
const fs = require('fs');
const myCookies = [{name: 'customer', value: 'Frank'}, {name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}];
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
const cookies = await page.cookies();
const cookieJson = JSON.stringify(cookies);
// Writes expected cookies to file for sanity check.
fs.writeFileSync('scratch_cookies.json', cookieJson);
// FIXME: Cookies never get appended to page.
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
server/scratch.html:
<html>
<body>
</body>
<script type='text/javascript'>
document.write('Cookie: ' + document.cookie);
</script>
</html>
The result is just a PNG with the word "Cookie:" on it. Any insight appreciated!
This is the actual route I'm using where makeExpiryLetter is utilizing puppeteer, but I can't seem to get it to actually read the customer and rows data.
app.get('/create-expiry-letter', async (req, res) => {
// Create PDF file using puppeteer to render React page w/ data.
// Store in Db.
// Email file.
// Send final count of letters sent back for notification in GUI.
const cc = await dbo.getConsignmentCustomers();
const result = await Promise.all(cc.rows.map(async x => {
// Get 0-60 day consignments by customer_id;
const { rows } = await dbo.getExpiry0to60(x.customer_id);
if (rows && rows.length > 0) {
const epiryLetter = await makeExpiryLetter(x, rows); // Uses puppeteer.
// TODO: Store in Db / Email file.
return true;
} else {
return false;
}
}));
res.json({ emails_sent: result.filter(x => x === true).length });
});
Thanks to the samples from #ggorlen I've made huge headway in using cookies. In my inline script of expiry.html I'm grabbing the values by wrapping my render function in function main () and adding onload to body tag <body onload='main()'.
Inside the main function we can grab the values I needed:
const customer = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('customer')).split('=')[1]);
const expiring = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('expiring')).split('=')[1]);
FINALLY (and yes, of course this will all be used in an automated worker in the end) I can get my beautifully rendered PDF like so:
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.pdf({ path: `scratch-expiry-letter.pdf`, format: 'letter' });
await browser.close();
})();
The problem is here:
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
The first line says, go to the page. Going to a page involves parsing the HTML and executing scripts, including your document.write('Cookie: ' + document.cookie); line in scratch.html, at which time there are no cookies on the page (assuming a clear browser cache).
After the page is loaded, await page.goto... returns and the line await page.setCookie(...myCookies); runs. This correctly sets your cookies and the remaining lines execute. const cookies = await page.cookies(); runs and pulls the newly-set cookies out and you write them to disk. await page.screenshot({path: 'scratch_shot.png'}); runs, taking a shot of the page without the DOM updated with the new cookies that were set after the initial document.write call.
You can fix this problem by turning your JS on the scratch.html page into a function that can be called after page load and cookies are set, or injecting such a function dynamically with Puppeteer using evaluate:
const puppeteer = require('puppeteer');
const myCookies = [
{name: 'customer', value: 'Frank'},
{name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('http://localhost:1234/testing');
await page.setCookie(...myCookies);
// now that the cookies are ready, we can write to the document
await page.evaluate(() => document.write('Cookie' + document.cookie));
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
A more general approach is to set the cookies before navigation. This way, the cookies will already exist when any scripts that might use them run.
const puppeteer = require('puppeteer');
const myCookies = [
{
name: 'expiring',
value: '[{"a":1,"b":"three"}]',
domain: 'localhost',
path: '/',
expires: -1,
size: 29,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
},
{
name: 'customer',
value: 'Frank',
domain: 'localhost',
path: '/',
expires: -1,
size: 13,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
That said, I'm not sure if cookies are the easiest or best way to do what you're trying to do. Since you're serving HTML, you could pass the data along with it statically, expose a separate API route to collect a customer's data which the front end can use, or pass GET parameters, depending on the nature of the data and what you're ultimately trying to accomplish.
You could even have a file upload form on the React app, then have Puppeteer upload the JSON data into the app programmatically through that form.
In fact, if your final goal is to dynamically generate a PDF, using React and Puppeteer might be overkill, but I'm not sure I have a better solution to offer without some research and additional context about your use case.

How do I combine puppeteer plugins with puppeteer clusters?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?
You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions
You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();

Categories

Resources