Infinite loop (on purpose) using puppeteer cluster - javascript

I am very new to puppeteer-cluster. My goal is to scrape a list of 100 sites infinitely, so once I get to the 100th link, script would start over again (Ideally reusing the same cluster instance). Is there a better way, or proper way to do this? I was thinking it could be easier to just have an infinite loop (and rotate elements), on purpose. Any advice would be appreciated.
Here's my code:
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 20,
monitor: true
});
// Extracts document.title of the crawled pages
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, { waitUntil: 'domcontentloaded' });
const pageTitle = await page.evaluate(() => document.title);
console.log(pageTitle);
});
// In case of problems, log them
cluster.on('taskerror', (err, data) => {
console.log(` Error crawling ${data}: ${err.message}`);
});
while (true) {
await new Promise(resolve => setTimeout(crawl, 5000));
}
async function crawl() {
for (let i = 0; i < sites.length; i++) {
const site = sites[i];
site["product_urls"].forEach(async (url) => {
await cluster.execute(url);
});
}
await cluster.idle();
}
})();

for (;;) {}
Will give you an infinite loop without running in to any issues from things like ESLint, or warnings about "unreachable code".
That said, it might not hurt to put a fallback condition so that your code is able to end safely if needed

Related

Running two browser instances in parallel for same list of websites in Puppeteer

I wrote javascript code for a web crawler that scraps data from a list of websites (in csv file) in a single browser instance (code below). Now I want to modify the code for the scenario in which every single website in the list runs parallel at the same time in two browser instances. For example, a website www.a.com in the list should run in parallel at the same time on two browser instances and the same goes for the rest of the websites. If anyone can help me, please. I would be very thankful.
(async () => {
require("dotenv").config();
if (!process.env.PROXY_SPKI_FINGERPRINT) {
throw new Error("PROXY_SPKI_FINGERPRINT is not defined in environment.");
}
const fs = require("fs");
const fsPromises = fs.promises;
const pptr = require("puppeteer");
const browser = await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
for (let i in sites) {
const site = sites[i];
console.log(`[${i}] ${site}`);
try {
await fsPromises.appendFile("data.txt", JSON.stringify(await crawl(browser, site)) + "\n");
} catch (e) {
console.error(e);
}
}
await browser.close();
async function crawl(browser, site) {
const page = await browser.newPage();
try {
const grepResult = [];
page.on("request", async request => {
request.continue();
})
page.on("response", async response => {
try {
if (response.request().resourceType() === "script" &&
response.headers()["content-type"] &&
response.headers()["content-type"].includes("javascript")) {
const js = await response.text();
const grepPartResult = grepMagicWords(js);
grepResult.push([response.request().url(), grepPartResult]);
}
} catch (e) {}
});
await page.setRequestInterception(true);
try {
await page.goto("http://" + site, {waitUntil: "load", timeout: 60000});
await new Promise(resolve => { setTimeout(resolve, 10000); });
} catch (e) { console.error(e); }
const [flows, url] = await Promise.race([
page.evaluate(() => [J$.FLOWS, document.URL]),
new Promise((_, reject) => { setTimeout(() => { reject(); }, 5000); })
]);
return {url: url, grepResult: grepResult, flows: flows};
} finally {
await page.close();
}
function grepMagicWords(js) {
var re = /(?:\'|\")(?:g|s)etItem(?:\'|\")/g, match, result = [];
while (match = re.exec(js)) {
result.push(js.substring(match.index - 100, match.index + 100));
}
return result;
}
}
})();
You can launch multiple browsers and run them in parallel. You would have to restructure your app slighltly for that. Create a wrapper for crawl which launches it with a new browser instance. I created crawlNewInstance which does that for you. You would also need to run crawlNewInstance() in parallel.
Checkout this code:
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
const crawlerProms = sites.map(async (site, index) => {
try {
console.log(`[${index}] ${site}`);
await fsPromises.appendFile("data.txt", JSON.stringify(await crawlNewInstance(site)) + "\n");
} catch (e) {
console.error(e);
}
}
// await all the crawlers!.
await Promise.all(crawlerProms)
async function crawlNewInstance(site) {
const browser = await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
const result = await crawl(browser, site)
await browser.close()
return result
}
optional
The above answers basically the question. But If you want to go further I was in a run and had nothing todo :)
If you have plenty of pages, which you wanted to crawl in parallel and for example limit the amount of parallel requests you could use a Queue:
var { EventEmitter} = require('events')
class AsyncQueue extends EventEmitter {
limit = 2
enqueued = []
running = 0
constructor(limit) {
super()
this.limit = limit
}
isEmpty() {
return this.enqueued.length === 0
}
// make sure to only pass `async` function to this queue!
enqueue(fn) {
// add to queue
this.enqueued.push(fn)
// start a job. If max instances are already running it does nothing.
// otherwise it runs a new job!.
this.next()
}
// if a job is done try starting a new one!.
done() {
this.running--
console.log('job done! remaining:', this.limit - this.running)
this.next()
}
async next() {
// emit if queue is empty once.
if(this.isEmpty()) {
this.emit('empty')
return
}
// if no jobs are available OR limit is reached do nothing
if(this.running >= this.limit) {
console.log('queueu full.. waiting!')
return
}
this.running++
console.log('running job! remaining slots:', this.limit - this.running)
// first in, first out! so take first element in array.
const job = this.enqueued.shift()
try {
await job()
} catch(err) {
console.log('Job failed!. ', err)
this.emit('error', err)
}
// job is done!
// Done() will call the next job if there are any available!.
this.done()
}
}
The queue could be utilised with this code:
// create queue
const limit = 3
const queue = new AsyncQueue(limit)
// listen for any errors..
queue.on('error', err => {
console.error('error occured in queue.', err)
})
for(let site of sites) {
// enqueue all crawler jobs.
// pass an async function which does whatever you want. In this case it crawls
// a web page!.
queue.enqueue(async() => {
await fsPromises.appendFile("data.txt", JSON.stringify(await crawlNewInstance(site)) + "\n");
})
}
// helper for watiing for the queue!
const waitForQueue = async () => {
if(queue.isEmpty) return Promise.resolve()
return new Promise((res, rej) => {
queue.once('empty', res)
})
}
await waitForQueue()
console.log('crawlers done!.')
Even further with BrowserPool
It would also be possible to reuse your browser instances, so it would not be necessary to start a new browser instance for every crawling process. This can be done using this Browserpool helper class
var pptr = require('puppeteer')
async function launchPuppeteer() {
return await pptr.launch({
args: [
"--proxy-server=https://127.0.0.1:8000",
"--ignore-certificate-errors-spki-list=" + process.env.PROXY_SPKI_FINGERPRINT,
"--disable-web-security",
],
// headless: false,
});
}
// manages browser connections.
// creates a pool on startup and allows getting references to
// the browsers! .
class BrowserPool {
browsers = []
async get() {
// return browser if there is one!
if(this.browsers.length > 0) {
return this.browsers.splice(0, 1)[0]
}
// no browser available anymore..
// launch a new one!
return await launchPuppeteer()
}
// used for putting a browser back in pool!.
handback(browser) {
this.browsers.push(browser)
}
// shuts down all browsers!.
async shutDown() {
for(let browser of this.browsers) {
await browser.close()
}
}
}
You can then remove crawlNewInstance() and adjust the code to look like this finally:
const sites = (await fsPromises.readFile(process.argv[2])) // sites list in csv file
.toString()
.split("\n")
.map(line => line.split(",")[1])
.filter(s => s);
// create browserpool
const pool = new BrowserPool()
// create queue
const limit = 3
const queue = new AsyncQueue(3)
// listen to errors:
queue.on('error', err => {
console.error('error in the queue detected!', err)
})
// enqueue your jobs
for(let site of sites) {
// enqueue an async function which takes a browser from pool
queue.enqueue(async () => {
try {
// get the browser and crawl a page!.
const browser = await pool.get()
const result = await crawl(browser, site)
await fsPromises.appendFile("data.txt", JSON.stringify(result) + "\n");
// return the browser back to pool so other crawlers can use it! .
pool.handback(browser)
} catch(err) {
console.error(err)
}
})
}
// helper for watiing for the queue!
const waitForQueue = async () => {
// maybe jobs fail in a few milliseconds so check first if its already empty..
if(queue.isEmpty) return Promise.resolve()
return new Promise((res, rej) => {
queue.once('empty', res)
})
}
// wait for the queue to finish :)
await waitForQueue()
// in the very end, shut down all browser:
await pool.shutDown()
console.log('done!.')
Have fun and feel free to leave a comment.

Puppeteer: scraping multiple urls one by one

I am trying to scrape multiple URL one by one, then repeat the scrape after one minute.
But I keep getting two errors and was hoping for some help.
I got an error saying:
functions declared within loops referencing an outer scoped variable may lead to confusing semantics
And I get this error when I run the function / code:
TimeoutError: Navigation timeout of 30000 ms exceeded.
My code:
const puppeteer = require("puppeteer");
const urls = [
'https://www.youtube.com/watch?v=cw9FIeHbdB8',
'https://www.youtube.com/watch?v=imy1px59abE',
'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
];
const scrape = async() => {
let browser, page;
try {
browser = await puppeteer.launch({ headless: true });
page = await browser.newPage();
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle2' });
await page.waitForSelector('.view-count', { visible: true, timeout: 60000 });
const data = await page.evaluate(() => { // functions declared within loops referencing an outer scoped on this line.
return [
JSON.stringify(document.querySelector('#text > a').innerText),
JSON.stringify(document.querySelector('#container > h1').innerText),
JSON.stringify(document.querySelector('.view-count').innerText),
JSON.stringify(document.querySelector('#owner-sub-count').innerText)
];
});
const [channel, title, views, subs] = [JSON.parse(data[0]), JSON.parse(data[1]), JSON.parse(data[2]), JSON.parse(data[3])];
console.log({ channel, title, views, subs });
}
} catch(err) {
console.log(err);
} finally {
if (browser) {
await browser.close();
}
await setTimeout(scrape, 60000); // repeat after one minute after all urls have been scrape.
}
};
scrape();
I would really appreciate any help I could get.
I'd suggest a design like this:
const puppeteer = require("puppeteer");
const sleep = ms => new Promise(resolve => setTimeout(resolve), ms);
const scrapeTextSelectors = async (browser, url, textSelectors) => {
let page;
try {
page = await browser.newPage();
page.setDefaultNavigationTimeout(50 * 1000);
page.goto(url);
const dataPromises = textSelectors.map(async ({name, sel}) => {
await page.waitForSelector(sel);
return [name, await page.$eval(sel, e => e.innerText)];
});
return Object.fromEntries(await Promise.all(dataPromises));
}
finally {
page?.close();
}
};
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const textSelectors = [
{name: "channel", sel: "#text > a"},
{name: "title", sel: "#container > h1"},
{name: "views", sel: ".view-count"},
{name: "subs", sel: "#owner-sub-count"},
];
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
for (;; await sleep(60 * 1000)) {
const data = await Promise.allSettled(urls.map(url =>
scrapeTextSelectors(browser, url, textSelectors)
));
console.log(data);
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
A few remarks:
This runs in parallel on the 3 URLs using Promise.allSettled. If you have more URLs, you'll want a task queue or run synchronously over the URLs with a for .. of loop so you don't outstrip the system's resources. See this answer for elaboration.
I use waitForSelector on each and every selector rather than just '.view-count' so you won't miss anything.
page.setDefaultNavigationTimeout(50 * 1000); gives you an adjustable 50-second delay on all operations.
Moving the loops to sleep and step over the URLs into the caller gives cleaner, more flexible code. Generally, if a function can operate on a single element rather than a collection, it should.
Error handling is improved; Promise.allSettled lets the caller control what to do if any requests fail. You might want to filter and/or map the data response to remove the statuses: data.map(({value}) => value).
Generally, return instead of console.log data to keep functions flexible. The caller can console.log in the format they desire, if they desire.
There's no need to do anything special in page.goto(url) because we're awaiting selectors on the very next line. "networkidle2" just slows things down, waiting for network requests that might not impact the selectors we're interested in.
JSON.stringify/JSON.parse is already called by Puppeteer on the return value of evaluate so you can skip it in most cases.
Generally, don't do anything but cleanup in finally blocks. await setTimeout(scrape, 60000) is misplaced.
This works. Putting the for loop in a Promise and waitUntil: "networkidle2" as an option when page.goto() resolves your problem. You don't need to generate a new browser each time, so it should be declared outside of the for loop.
const puppeteer = require("puppeteer");
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
new Promise(async (resolve, reject) => {
for (url of urls) {
// your timeout
await page.waitForTimeout(6 * 1000);
await page.goto(`${url}`, {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
await page.waitForSelector(".view-count", {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
const data = await page.evaluate(() => {
return [
JSON.stringify(document.querySelector("#text > a").innerText),
JSON.stringify(document.querySelector("#container > h1").innerText),
JSON.stringify(document.querySelector(".view-count").innerText),
JSON.stringify(document.querySelector("#owner-sub-count").innerText),
];
});
const [channel, title, views, subs] = [
JSON.parse(data[0]),
JSON.parse(data[1]),
JSON.parse(data[2]),
JSON.parse(data[3]),
];
console.log({ channel, title, views, subs });
}
resolve(true);
})
.then(async () => {
await browser.close();
})
.catch((reason) => {
console.log(reason);
});
};
scrape();
#Update
As per ggorlen suggestion, the below-refactored code should serve your problem. Comment in the code indicates the purpose of that line
const puppeteer = require("puppeteer");
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const scrape = async () => {
// generate a headless browser instance
const browser = await puppeteer.launch({ headless: true });
// used .entries to get the index and value
for (const [index, url] of urls.entries()) {
// generating a new page for each of the content
const page = await browser.newPage();
// your 60 timeout from 2nd index
if (index > 0) await page.waitForTimeout(60 * 1000);
// wait for the page response to available with 60 seconds timeout (error throw)
await page.goto(`${url}`, {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
// wait for .view-count section to be available
await page.waitForSelector(".view-count");
// don't need json stringify or parse as puppeteer does so
await page.evaluate(() =>
({
channel: document.querySelector("#text > a").innerText,
title: document.querySelector("#container > h1").innerText,
views: document.querySelector(".view-count").innerText,
subs: document.querySelector("#owner-sub-count").innerText
})
).then(data => {
// your scrapped success data
console.log('response', data);
}).catch(reason => {
// your scrapping error reason
console.log('error', reason);
}).finally(async () => {
// close your current page
await page.close();
})
}
// after looping through finally close the browser
await browser.close();
};
scrape();

How to reload and wait for an element to appear?

I tried searching for this answer but there doesn't seem to be an answer on the Internet. What I want to do is use node js to reload a page until it finds the element with the query I want. I will be using puppeteer for other parts of the program if that will help.
Ok, I used functions from both answers and came up with this, probably unoptimized code:
const puppeteer = require("puppeteer");
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("http://127.0.0.1:5500/main.html");
await page.waitForSelector("#buy-button");
console.log("worked");
} catch (err) {
console.log(`ERROR: ${err}`);
}
})();
But what I don't know how to do is to reload the page, and keep reloading until the id I want is there. For example, keep reloading youtube until the video you want is there(unpractical example, but I think it gets the point across).
Here's how I solved waiting for an element in puppeteer and reloading the page if it wasn't found;
async waitForSelectorWithReload(selector: string) {
const MAX_TRIES = 5;
let tries = 0;
while (tries <= MAX_TRIES) {
try {
const element = await this.page.waitForSelector(selector, {
timeout: 5000,
});
return element;
} catch (error) {
if (tries === MAX_TRIES) throw error;
tries += 1;
void this.page.reload();
await this.page.waitForNavigation({ waitUntil: 'networkidle0' });
}
}
}
And can be used as;
await waitForSelectorWithReload("input#name")
You can use "waitUntil: "networkidle2" to make sure the page is done loading. Obviously change the url, unless you are actually using evil.com
const puppeteer = require("puppeteer"); // include library
(async () =>{
const browser = await puppeteer.launch(); // run browser
const page = await browser.newPage(); // create new tab
await page.goto(
`http://www.evil.com`,
{
waitUntil: "networkidle2",
}
);
// do your stuff here
await browser.close();
})();
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
page
.waitForSelector('#myId')
.then(() => console.log('got it'));
browser.close();
});

Puppeteer browser and page instant close, page.evaluate not working on bloc 2

i have this script, everything works until the start of block 2
I do not understand why it does not do the work in block 2, I should have a return in "page.on request" but it is not the case it leaves directly, would you have an idea of the problem?
node return no error to me
thanks
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 720})
await page.goto('https://site.local', { waitUntil: 'networkidle0' }); // wait until page load
await page.type("input[name='UserName']", "myusername");
await page.type("input[name='Password']", "mypassworduser");
// click and wait for navigation
await Promise.all([
page.click("body > main > button"),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
await page.goto(urlformation);
await page.setRequestInterception(true);
await page.on('request', (request) => {
if (request.resourceType() === 'media') {
var CurrentRequest = request.url();
console.log(CurrentRequest);
fs.appendFileSync(fichiernurlaudio, request.url()+"\r\n");
}
request.continue();
});
//START BLOC 1 ------------------IT WORK
const Titresaudios = await page.evaluate(() => {
let names = document.querySelectorAll(
"td.cursor.audio"
);
let arr = Array.prototype.slice.call(names);
let text_arr = [];
for (let i = 0; i < arr.length; i += 1) {
text_arr.push($vartraited+"\r\n");
}
return text_arr;
})
fs.appendFileSync(fichiernomaudio, Titresaudios);
//END BLOCK 1------------------IT WORK- i got data in my file
//START BLOCK 2-------seems to ignore-----------NOT WORKING
await page.evaluate(()=>{
let elements = document.querySelectorAll("td.cursor.audio");
elements.forEach((element, index) => {
setTimeout(() => {
element.click();
}, index * 1000);
})
})
//END BLOCK 2---------seems to ignore---------NO WORKING
//i should see some console.log in page.on('request' (request) => { but instant close after works of bloc 1
await page.close();
await browser.close();
}
main();
I have no clue, what exactly you are trying to achieve there, but that block could be rewritten like this:
// ...
const els = await page.$$( 'td.cursor.audio' );
for( const el of els ) {
// basically your timeout, but from outside the browser
await page.waitFor( 1000 );
// perform the action
await el.click();
}
// ...
In your script the only thing you did in the evaluate() call was to schedule a few timeout-actions. As soon as those were scheduled (but not executed!) the callback of evaluate() exits and your script proceeds with closing the browser. So likely your clicks were never executed.
In my experience it is usually advisable to do as much as you can in NodeJS and not within the browser. Usually also makes for easier debugging.

Async puppeteer browser disconnect after few iterations

I tested iterations with puppeteer in a small case. I already have read the common reason for puppeteer disconnections are that the Node script doesnt wait for the puppeteer actions to be ended. So I converted all functions in my snippet into async functions but it didnt help.
If the small case with six iterations work I will implement it in my current project with like 50 iterations.
'use strict';
const puppeteer = require('puppeteer');
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const call = async () => {
await puppeteer.launch().then(async (browser) => {
arrIDs.forEach(async (id, index, arr) => {
await browser.newPage().then(async (page) => {
await page.goto(`http://somelink.com/${id}`).then(async () => {
await page.$eval('div.info > table > tbody', async (heading) => {
return heading.innerText;
}).then(async (result) => {
await browser.close();
console.log(result);
});
});
});
});
});
};
call();
forEach executes synchronously. replace forEach with a simple for loop.
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const page = await browser.newPage();
for (let id of arrIDs){
await page.goto(`http://somelink.com/${id}`);
let result = await page.$eval('div.info > table > tbody', heading => heading.innerText).catch(e => void e);
console.log(result);
}
await browser.close()
The way you've formatted and nested everything seems like some incarnation of callback hell.
Here's my suggestion, its not working, but the structure is going to work better for Async / Await
const puppeteer = require("puppeteer");
const chromium_path_706915 =
"706915/chrome.exe";
async function Run() {
arrIDs.forEach(
await Navigate();
)
async function Navigate(url) {
const browser = await puppeteer.launch({
executablePath: chromium_path_706915,
args: ["--auto-open-devtools-for-tabs"],
headless: false
});
const page = await browser.newPage();
const response = await page.goto(url);
const result = await page.$$eval("div.info > table > tbody", result =>
result.map(ele2 => ({
etd: ele2.innerText.trim()
}))
);
await browser.close();
console.log(result);
}
}
run();
On top of the other answers, I want to point out that async and forEach loops don't exactly play as expected. One possible solution is having a custom implementation that supports this:
Utility function:
async function asyncForEach(array: Array<any>, callback: any) {
for (let index = 0; index < array.length; index++) {
await callback(array[index], index, array);
}
}
Example usage:
const start = async () => {
await asyncForEach([1, 2, 3], async (num) => {
await waitFor(50);
console.log(num);
});
console.log('Done');
}
start();
Going through this article by Sebastien Chopin can help make it a bit more clear as to why async/await and forEach act unexpectedly. Here it is as a gist.

Categories

Resources