I tested iterations with puppeteer in a small case. I already have read the common reason for puppeteer disconnections are that the Node script doesnt wait for the puppeteer actions to be ended. So I converted all functions in my snippet into async functions but it didnt help.
If the small case with six iterations work I will implement it in my current project with like 50 iterations.
'use strict';
const puppeteer = require('puppeteer');
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const call = async () => {
await puppeteer.launch().then(async (browser) => {
arrIDs.forEach(async (id, index, arr) => {
await browser.newPage().then(async (page) => {
await page.goto(`http://somelink.com/${id}`).then(async () => {
await page.$eval('div.info > table > tbody', async (heading) => {
return heading.innerText;
}).then(async (result) => {
await browser.close();
console.log(result);
});
});
});
});
});
};
call();
forEach executes synchronously. replace forEach with a simple for loop.
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const page = await browser.newPage();
for (let id of arrIDs){
await page.goto(`http://somelink.com/${id}`);
let result = await page.$eval('div.info > table > tbody', heading => heading.innerText).catch(e => void e);
console.log(result);
}
await browser.close()
The way you've formatted and nested everything seems like some incarnation of callback hell.
Here's my suggestion, its not working, but the structure is going to work better for Async / Await
const puppeteer = require("puppeteer");
const chromium_path_706915 =
"706915/chrome.exe";
async function Run() {
arrIDs.forEach(
await Navigate();
)
async function Navigate(url) {
const browser = await puppeteer.launch({
executablePath: chromium_path_706915,
args: ["--auto-open-devtools-for-tabs"],
headless: false
});
const page = await browser.newPage();
const response = await page.goto(url);
const result = await page.$$eval("div.info > table > tbody", result =>
result.map(ele2 => ({
etd: ele2.innerText.trim()
}))
);
await browser.close();
console.log(result);
}
}
run();
On top of the other answers, I want to point out that async and forEach loops don't exactly play as expected. One possible solution is having a custom implementation that supports this:
Utility function:
async function asyncForEach(array: Array<any>, callback: any) {
for (let index = 0; index < array.length; index++) {
await callback(array[index], index, array);
}
}
Example usage:
const start = async () => {
await asyncForEach([1, 2, 3], async (num) => {
await waitFor(50);
console.log(num);
});
console.log('Done');
}
start();
Going through this article by Sebastien Chopin can help make it a bit more clear as to why async/await and forEach act unexpectedly. Here it is as a gist.
Related
I am trying to scrape multiple URL one by one, then repeat the scrape after one minute.
But I keep getting two errors and was hoping for some help.
I got an error saying:
functions declared within loops referencing an outer scoped variable may lead to confusing semantics
And I get this error when I run the function / code:
TimeoutError: Navigation timeout of 30000 ms exceeded.
My code:
const puppeteer = require("puppeteer");
const urls = [
'https://www.youtube.com/watch?v=cw9FIeHbdB8',
'https://www.youtube.com/watch?v=imy1px59abE',
'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
];
const scrape = async() => {
let browser, page;
try {
browser = await puppeteer.launch({ headless: true });
page = await browser.newPage();
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle2' });
await page.waitForSelector('.view-count', { visible: true, timeout: 60000 });
const data = await page.evaluate(() => { // functions declared within loops referencing an outer scoped on this line.
return [
JSON.stringify(document.querySelector('#text > a').innerText),
JSON.stringify(document.querySelector('#container > h1').innerText),
JSON.stringify(document.querySelector('.view-count').innerText),
JSON.stringify(document.querySelector('#owner-sub-count').innerText)
];
});
const [channel, title, views, subs] = [JSON.parse(data[0]), JSON.parse(data[1]), JSON.parse(data[2]), JSON.parse(data[3])];
console.log({ channel, title, views, subs });
}
} catch(err) {
console.log(err);
} finally {
if (browser) {
await browser.close();
}
await setTimeout(scrape, 60000); // repeat after one minute after all urls have been scrape.
}
};
scrape();
I would really appreciate any help I could get.
I'd suggest a design like this:
const puppeteer = require("puppeteer");
const sleep = ms => new Promise(resolve => setTimeout(resolve), ms);
const scrapeTextSelectors = async (browser, url, textSelectors) => {
let page;
try {
page = await browser.newPage();
page.setDefaultNavigationTimeout(50 * 1000);
page.goto(url);
const dataPromises = textSelectors.map(async ({name, sel}) => {
await page.waitForSelector(sel);
return [name, await page.$eval(sel, e => e.innerText)];
});
return Object.fromEntries(await Promise.all(dataPromises));
}
finally {
page?.close();
}
};
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const textSelectors = [
{name: "channel", sel: "#text > a"},
{name: "title", sel: "#container > h1"},
{name: "views", sel: ".view-count"},
{name: "subs", sel: "#owner-sub-count"},
];
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
for (;; await sleep(60 * 1000)) {
const data = await Promise.allSettled(urls.map(url =>
scrapeTextSelectors(browser, url, textSelectors)
));
console.log(data);
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
A few remarks:
This runs in parallel on the 3 URLs using Promise.allSettled. If you have more URLs, you'll want a task queue or run synchronously over the URLs with a for .. of loop so you don't outstrip the system's resources. See this answer for elaboration.
I use waitForSelector on each and every selector rather than just '.view-count' so you won't miss anything.
page.setDefaultNavigationTimeout(50 * 1000); gives you an adjustable 50-second delay on all operations.
Moving the loops to sleep and step over the URLs into the caller gives cleaner, more flexible code. Generally, if a function can operate on a single element rather than a collection, it should.
Error handling is improved; Promise.allSettled lets the caller control what to do if any requests fail. You might want to filter and/or map the data response to remove the statuses: data.map(({value}) => value).
Generally, return instead of console.log data to keep functions flexible. The caller can console.log in the format they desire, if they desire.
There's no need to do anything special in page.goto(url) because we're awaiting selectors on the very next line. "networkidle2" just slows things down, waiting for network requests that might not impact the selectors we're interested in.
JSON.stringify/JSON.parse is already called by Puppeteer on the return value of evaluate so you can skip it in most cases.
Generally, don't do anything but cleanup in finally blocks. await setTimeout(scrape, 60000) is misplaced.
This works. Putting the for loop in a Promise and waitUntil: "networkidle2" as an option when page.goto() resolves your problem. You don't need to generate a new browser each time, so it should be declared outside of the for loop.
const puppeteer = require("puppeteer");
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
new Promise(async (resolve, reject) => {
for (url of urls) {
// your timeout
await page.waitForTimeout(6 * 1000);
await page.goto(`${url}`, {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
await page.waitForSelector(".view-count", {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
const data = await page.evaluate(() => {
return [
JSON.stringify(document.querySelector("#text > a").innerText),
JSON.stringify(document.querySelector("#container > h1").innerText),
JSON.stringify(document.querySelector(".view-count").innerText),
JSON.stringify(document.querySelector("#owner-sub-count").innerText),
];
});
const [channel, title, views, subs] = [
JSON.parse(data[0]),
JSON.parse(data[1]),
JSON.parse(data[2]),
JSON.parse(data[3]),
];
console.log({ channel, title, views, subs });
}
resolve(true);
})
.then(async () => {
await browser.close();
})
.catch((reason) => {
console.log(reason);
});
};
scrape();
#Update
As per ggorlen suggestion, the below-refactored code should serve your problem. Comment in the code indicates the purpose of that line
const puppeteer = require("puppeteer");
const urls = [
"https://www.youtube.com/watch?v=cw9FIeHbdB8",
"https://www.youtube.com/watch?v=imy1px59abE",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
];
const scrape = async () => {
// generate a headless browser instance
const browser = await puppeteer.launch({ headless: true });
// used .entries to get the index and value
for (const [index, url] of urls.entries()) {
// generating a new page for each of the content
const page = await browser.newPage();
// your 60 timeout from 2nd index
if (index > 0) await page.waitForTimeout(60 * 1000);
// wait for the page response to available with 60 seconds timeout (error throw)
await page.goto(`${url}`, {
waitUntil: "networkidle2",
timeout: 60 * 1000,
});
// wait for .view-count section to be available
await page.waitForSelector(".view-count");
// don't need json stringify or parse as puppeteer does so
await page.evaluate(() =>
({
channel: document.querySelector("#text > a").innerText,
title: document.querySelector("#container > h1").innerText,
views: document.querySelector(".view-count").innerText,
subs: document.querySelector("#owner-sub-count").innerText
})
).then(data => {
// your scrapped success data
console.log('response', data);
}).catch(reason => {
// your scrapping error reason
console.log('error', reason);
}).finally(async () => {
// close your current page
await page.close();
})
}
// after looping through finally close the browser
await browser.close();
};
scrape();
Let's say we have an async generator:
exports.asyncGen = async function* (items) {
for (const item of items) {
const result = await someAsyncFunc(item)
yield result;
}
}
is it possible to map over this generator? Essentially I want to do this:
const { asyncGen } = require('./asyncGen.js')
exports.process = async function (items) {
return asyncGen(items).map(item => {
//... do something
})
}
As of now .map fails to recognize async iterator.
The alternative is to use for await ... of but that's nowhere near elegant as with .map
The iterator methods proposal that would provide this method is still at stage 2 only. You can use some polyfill, or write your own map helper function though:
async function* map(asyncIterable, callback) {
let i = 0;
for await (const val of asyncIterable)
yield callback(val, i++);
}
exports.process = function(items) {
return map(asyncGen(items), item => {
//... do something
});
};
The alternative is to use for await ... of, but that's nowhere near elegant as with .map
For an elegant and efficient solution, here's one using iter-ops library:
import {pipe, map} from 'iter-ops';
const i = pipe(
asyncGen(), // your async generator result
map(value => /*map logic*/)
); //=> AsyncIterable
It is elegant, because the syntax is clean, simple, and applicable to any iterable or iterators, not just asynchronous generators.
It is more flexible and reusable, as you can add lots of other operators to the same pipeline.
Since it produces a standard JavaScript AsyncIterable, you can do:
for await(const a of i) {
console.log(a); //=> print values
}
P.S. I'm the author of iter-ops.
TL;DR - If the mapping function is async:
To make asyncIter not wait for each mapping before producing the next value, do
async function asyncIterMap(asyncIter, asyncFunc) {
const promises = [];
for await (const value of asyncIter) {
promises.push(asyncFunc(value))
}
return await Promise.all(promises)
}
// example - how to use:
const results = await asyncIterMap(myAsyncIter(), async (str) => {
await sleep(3000)
return str.toUpperCase()
});
More Demoing:
// dummy asyncIter for demonstration
const sleep = (ms) => new Promise(res => setTimeout(res, ms))
async function* myAsyncIter() {
await sleep(1000)
yield 'first thing'
await sleep(1000)
yield 'second thing'
await sleep(1000)
yield 'third thing'
}
Then
// THIS IS BAD! our asyncIter waits for each mapping.
for await (const thing of myAsyncIter()) {
console.log('starting with', thing)
await sleep(3000)
console.log('finished with', thing)
}
// total run time: ~12 seconds
Better version:
// this is better.
const promises = [];
for await (const thing of myAsyncIter()) {
const task = async () => {
console.log('starting with', thing)
await sleep(3000)
console.log('finished with', thing)
};
promises.push(task())
}
await Promise.all(promises)
// total run time: ~6 seconds
const puppeteer = require('puppeteer');
const init = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// login
let login = async () => {
console.log('login init');
await page.goto(HOME_PAGE);
await page.type($clientID, CLIENT_ID);
await page.type($userName, USER_NAME);
await page.type($password, PASSWORD);
await page.click($submitBtn);
await page.waitFor(WAIT_SEC);
await page.goto(SCHEDULE_PAGE);
console.log('login end');
}
// look for schedule
let setStartDate = async () => {
console.log('start init');
await page.waitFor(3000);
await page.click('#selfsched_startDate_dtInput', { clickCount: 3 });
await page.keyboard.press('Backspace');
await page.type($startDate, START_DATE);
console.log('start end');
}
let setEndDate = async () => {
console.log('end init');
await page.click($endDate, { clickCount: 3 });
await page.keyboard.press('Backspace');
await page.type($endDate, END_DATE);
await page.keyboard.press('Enter');
console.log('end end');
}
let confirmSchedule = async () => {
console.log('confirm init');
await page.waitFor(WAIT_SEC);
await page.click($confirmBtn);
console.log('confirm end');
}
let steps = [
login(),
setStartDate(),
setEndDate(),
confirmSchedule()
];
await Promise.all(steps);
console.log('im finishing');
browser.close();
}
init()
.then(values => {
console.log('success');
})
.catch(err => {
});
Whenever my code gets to the setStartDate function nothing happens. I've added console.log messages but they're not coming in sequential order as i thought they would. I thought Promise.all() waits for everything in order..... also my knowledge in async / promises / await is not the greatest :) Thanks for the help
order of console logs im getting
login init
start init
end init
confirm init
login end
I thought Promise.all() waits for everything in order
This is basically the opposite of what Promise.all does:
There is no implied ordering in the execution of the array of Promises given. On some computers, they may be executed in parallel, or in some sense concurrently, while on others they may be executed serially. For this reason, there must be no dependency in any Promise on the order of execution of the Promises.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/all
You should just await your functions in order:
await login()
await setStartDate()
await setEndDate()
await confirmSchedule()
I have two files that are arrays, and i want to load them from a fetch. I have an async function that fetch the files:
async function getData(file) {
const data = await fetch(`./assets/resources/${file}.json`);
return await data.json()
}
Then here is where i assign the variables to the return fo this fetch:
let notes = getData("notes").then(res => res)
let pentagrama = getData("pentagrama").then(res => res)
But with this all i get is:
from google chrome console
How can i actually get the value?
The result of getData is always a Promise that resolves to your data. To access the values, you can use async/await:
(async () => {
let notes = await getData("notes");
let pentagrama = await getData("pentagrama");
// use them here
})();
Alternatively, you can use Promise.all to wait for both promises to resolve, and then access the received data:
let notesP = getData("notes");
let pentagramaP = getData("pentagrama");
Promise.all([noteP, pentagramaP]).then(res => {
let notes = res[0];
let pentagrama = res[1];
// use them here
});
ASYNC
AWAIT
This will work for you if you just want to check the response in your Google Chrome console because in the console you can use await without an async function which probably could be because everything executed in the console is wrapped in an async function by default(just a speculation).
ONLY WORKS IN CONSOLE:
const getData = (file) => (
fetch(`./assets/resources/${file}.json`).then(data => data.json());
)
let notes = await getData("notes")
let pentagrama = await getData("pentagrama")
But if you want to get this working in an application, remember that you ALWAYS need to wrap an await inside async
TO GET IT WORKING IN AN APPLICATION:
const getData = async (file) => (
await fetch(`./assets/resources/${file}.json`).then(data => data.json());
)
const wrapperFunc = async () => {
let notes = await getData("notes")
let pentagrama = await getData("pentagrama")
}
I'm currently learning how to use ES8's fetch, async and await I currently have this code that works:
const url = "https://api.icndb.com/jokes/random";
async function tellJoke() {
let data = await (await fetch(url)).json();
return data.value.joke;
}
tellJoke().then(data => console.log(data));
Console:
"Chuck Norris can dereference NULL."
but I found a snippet using an arrow function, the problem is that I don't know how to return my value the way I'm doing it in my current example.
SNIPPET:
const fetchAsync = async () =>
await (await fetch(url)).json()
If this is not a best practice let me know, also any further reading is welcomed.
You can again use the same approach that you used to shorten the usual
async function tellJoke() {
let response = await fetch(url);
let data = await response.json();
return data.value.joke;
}
to your implementation. As a one-liner it can look like this:
const tellJoke = async () => (await (await fetch(url)).json()).value.joke;
Use as same in the function. If you have no body expression in your code (witout {}), it will return the result of the statement. In this case the result of await (await fetch(url)).json().value.joke.
const fetchAsync = async () => (await (await fetch(url)).json()).value.joke;
or with multi line body. With body expression {} you need explicitly return as in simple function.
const fetchAsync = async () => {
const result = await fetch(url);
const data = await result.json();
return data.value.joke;
}