slow looping over pages and extracting data using puppeteer

slow looping over pages and extracting data using puppeteer - javascript

I have a table looking like that. All the names in the name columns are links that navigate to the next page.
|---------------------|------------------|
| NAME | ID |
|---------------------|------------------|
| Name 1 | 1 |
|---------------------|------------------|
| Name 2 | 2 |
|---------------------|------------------|
| Name 3 | 3 |
|---------------------|------------------|
I am trying to grab the link, extract data from it and then return back to the table. However, there are over 4000 records in the table and everything is processed very slowly (around 1000ms per record)
Here is my code that:
//Grabs all table rows.
const items = await page.$$(domObjects.itemPageLink);
for (let i = 0; i < items.length; i++) {
await page.goto(url);
await page.waitForSelector(domObjects.itemPageLink);
let items = await page.$$(domObjects.itemPageLink);
const item = items[i];
let id = await item.$eval("td:last-of-type", node => node.innerText.split(",").map(item => item.trim()));
let link = await item.$eval("td:first-of-type a", node => node.click());
await page.waitForSelector(domObjects.itemPageWrapper);
let itemDetailsPage = await page.$(domObjects.itemPageWrapper);
let title = await itemDetailsPage.$eval(".page-header__title", title => title.innerText);
console.log(title);
console.log(id);
}
Is there a way to speed up this so I can get all the results at once much quicker? I would like to use this for my API.

There are some minor code improvements and one major improvement which can be applied here.
Minor improvements: Use fewer puppeteer functions
The minor improvements boil down to using as few of the puppeteer functions as possible. Most of the puppeteer functions you use, are sending data from the Node.js environment to the browser environment via a WebSocket. While this only takes a few milliseconds, these milliseconds obviously add up in the long run. For more information on this, you can check out this question asking about the difference of using page.evaluate vs. using more puppeteer functions.
This means, to optimize your code, you can for example use querySelector inside of the page instead of running item.$eval multiple times. Another optimization is to directly use the result of page.waitForSelector. The function will return the node, when it appears. Therefore, you do not need to query it via page.$ again afterwards.
These are only minor improvements, which might slightly improve the crawling speed.
Major improvement: Use a puppeteer pool
Right now, you are using one browser with a single page to crawl multiple URLs. You can improve the speed of your script by using a pool of puppeteer resources to crawl multiple URLs in parallel. puppeteer-cluster allows you to do exactly that (disclaimer: I'm the author). The library takes a task and applies it to a number of URLs in parallel.
The number of parallel instances, you can use depends on your CPU, memory and throughput. The more you can use the better your crawling speed will be.
Code Sample
Below is a minimal example, adapting your code to extract the same data. The code first sets up a cluster with one browser and four pages. After that, a task function is defined which will be executed for each of the queued objects.
After this, one page instance of the cluster is used to extract the IDs and URLs from the initial page. The function given to the cluster.queue extracts the IDs and URLs from the page and calls cluster.queue with the objects being { id: ..., url: ... }. For each of the queued objects, the cluster.task function is executed, which then extracts the title and prints it out next to the passed ID.
// Setup your cluster with 4 pages
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// Define the task for the pages (go the the URL, and extract the title)
await cluster.task(async ({ page, data: { id, url } }) => {
await page.goto(url);
const itemDetailsPage = await page.waitForSelector(domObjects.itemPageWrapper);
const title = await itemDetailsPage.$eval('.page-header__title', title => title.innerText);
console.log(id, url);
});
// use one page of the cluster, to extract the links (ignoring the task function above)
cluster.queue(({ page }) => {
await page.goto(url); // URLs is given from outside
// Extract the links and ids from the initial page by using a page of the cluster
const itemData = await page.$$eval(domObjects.itemPageLink, items => items.map(item => ({
id: item.querySelector('td:last-of-type').split(',').map(item => item.trim()),
url: item.querySelector('td:first-of-type a').href,
})));
// Queue the data: { id: ..., url: ... } to start the process
itemData.forEach(data => cluster.queue(data));
});

Related

Puppeteer Page.$$eval() method returning empty arrays

I'm building a web scraping application with puppeteer. I'm trying to get an array of links to scrape from but it returns an empty array.
const scraperObject = {
url: 'https://www.walgreens.com/',
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}...`);
await page.goto(this.url);
// Wait for the required DOM to be rendered
await page.waitForSelector('.CA__Featured-categories__full-width');
// Get the link to all the required links in the featured categories
let urls = await page.$$eval('.list__contain > ul#at-hp-rp-featured-ul > li', links => {
// Extract the links from the data
links = links.map(el => el.querySelector('li > a').href)
return links;
});
Whenever I ran this, the console would give me the needed array of links (example below).
Navigating to https://www.walgreens.com/...
[
'https://www.walgreens.com/seasonal/holiday-gift-shop?ban=dl_dl_FeatCategory_HolidayShop_TEST'
'https://www.walgreens.com/store/c/cough-cold-and-flu/ID=20002873-tier1?ban=dl_dl_FeatCategory_CoughColdFlu_TEST'
'https://www.walgreens.com/store/c/contact-lenses/ID=359432-tier2clense?ban=dl_dl_FeatCategory_ContactLenses_TEST'
]
So, from here I had the navigate to one of those urls through the code block below and rerun the same code to go through an array of categories to eventually navigate to the product listings page.
//Navigate to Household Essentials
let hEssentials = await browser.newPage();
await hEssentials.goto(urls[11]);
// Wait for the required DOM to be rendered
await page.waitForSelector('.content');
// Get the link to all the required links in the featured categories
let shopByNeedUrls = await page.$$eval('div.wag-row > div.wag-col-3 wag-col-md-6 wag-col-sm-6 CA__MT30', links1 => {
// Extract the links from the data
links1 = links1.map(el => el.querySelector('div > a').href)
return links1;
});
console.log(shopByNeedUrls);
}
However, whenever I run this code through the console, I receive the same navigating message but then I return an empty array(as shown in the example below)
Navigating to https://www.walgreens.com/...
[]
If anyone is able to explain why I'm outputting an empty array, that'd be great. Thank you.
I've attempted to change the parameter of the page.waitForSelector method and the page.$$eval method. However none of them appeared to work and output the same result. In fact, I recieve a timeout error sometimes for the the page.waitForSelector method.

Should I be storing poke-api data locally?

I'm building a complete pokedex app using react-native/expo with all 900+ Pokémon.
I've tried what seems like countless ways of fetching the data from the API, but it's really slow.
Not sure if it's my code or the sheer amount of data:
export const getAllPokemon = async (offset: number) => {
const data = await fetch(
`https://pokeapi.co/api/v2/pokemon?limit=10&offset=${offset}`
);
const json = await data.json();
const pokemonURLS: PokemonURLS[] = json.results;
const monData: PokemonType[] = await Promise.all(
pokemonURLS.map(async (p, index: number) => {
const response = await fetch(p.url);
const data: PokemonDetails = await response.json();
const speciesResponse = await fetch(data.species.url);
const speciesData: SpeciesInfo = await speciesResponse.json();
return {
name: data.name,
id: data.id,
image: `https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/other/official-artwork/${data.id}.png`,
description: speciesData.flavor_text_entries,
varieties: speciesData.varieties,
color: speciesData.color.name,
types: data.types,
abilities: data.abilities,
};
})
);
Then I'm using it with a useEffect that increases offset by 10 each time and concats the arrays, until offset > 900.
However like I said, it's really slow.
Should I be saving this data locally to speed things up a little?
And how would I go about it? Should I use local storage or save an actual file somewhere in my project folder with the data?

The biggest performance issue I can see is the multiple fetches you perform as you loop though each pokemon.
I'm guessing that the data returned by the two nested fetches (response and speciesResponse) is reference data and are potentially the same for multiple pokemon. If this is the case, and you can't change the api, then two options pop to mind:
Load the reference data only when needed ie. when a user clicks on a pokemon to view details.
or
Get ALL the reference data before the pokemon data and either combine it with your pokemon fetch results or store it locally and reference it as needed. The first way can be achieved using local state - just keep it long enough to merge the relevant data with the pokemon data. The second will need application state like redux or browser storage (see localStorage or indexeddb).

Node JS Auto Reload

I want to make a program with javascript or node.js what I want to achieve from the program is when there is a new item in rss that I take it will get a log through the terminal, and for the future I will put the code in firebase hosting, so I need that the code can run by itself the log that I will get maybe I will change it into a text file or stored in a database
so like this
I run the program and get all the items on RSS,
but when there is a new item I don't have to run the node app.js again, so every time there is a new item in the rss it will display the log by itself automatically
so far i made it with js node and i use rss-parser
and the code I use like this:
let Parser = require('rss-parser');
let parser = new Parser();
(async () => {
let feed = await parser.parseURL('https://rss-checker.blogspot.com/feeds/posts/default?alt=rss');
feed.items.forEach(items => {
console.log(items);
})
})();

There are three common ways to achieve this:
Polling
Stream push
Webhook
Based on your code sample I assume that the RSS feeder is request/response. This lends well to polling.
A poll based program will make a request to a resource on an interval. This interval should be informed by resource limits and expected performance from the end user. Ideally the API will accept an offset or a page so you could request all feeds above some ID. This make the program stateful.
setInterval can be used to drive the polling loop. Below shows an example of the poller loop with no state management. It polls at 5 second intervals:
let Parser = require('rss-parser');
let parser = new Parser();
setInterval(async () => {
let feed = await parser.parseURL('https://rss-checker.blogspot.com/feeds/posts/default?alt=rss');
feed.items.forEach(items => {
console.log(items);
})
}), 5000);
This is incomplete because it needs to keep track of already seen posts. Creating a poll loop means you have a stateful process that needs to stay running.

Synchronize critical section in API for each user in JavaScript

I wanted to swap a profile picture of a user. For this, I have to check the database to see if a picture has already been saved, if so, it should be deleted. Then the new one should be saved and entered into the database.
Here is a simplified (pseudo) code of that:
async function changePic(user, file) {
// remove old pic
if (await database.hasPic(user)) {
let oldPath = await database.getPicOfUser(user);
filesystem.remove(oldPath);
}
// save new pic
let path = "some/new/generated/path.png";
file = await Image.modify(file);
await Promise.all([
filesystem.save(path, file),
database.saveThatUserHasNewPic(user, path)
]);
return "I'm done!";
}
I ran into the following problem with it:
If the user calls the API twice in a short time, serious errors occur. The database queries and the functions in between are asynchronous, causing that the changes of the first API call weren't applied when the second API checks for a profile pic to delete. So I'm left with a filesystem.remove request for an already unexisting file and an unremoved image in the filesystem.
I would like to safely handle that situation by synchronizing this critical section of code. I don't want to reject requests only because the server hasn't finished the previous one and I also want to synchronize it for each user, so users aren't bothered by the actions of other users.
Is there a clean way to achieve this in JavaScript? Some sort of monitor like you know it from Java would be nice.

You could use a library like p-limit to control your concurrency. Use a map to track the active/pending requests for each user. Use their ID (which I assume exists) as the key and the limit instance as the value:
const pLimit = require('p-limit');
const limits = new Map();
function changePic(user, file) {
async function impl(user, file) {
// your implementation from above
}
const { id } = user // or similar to distinguish them
if (!limits.has(id)) {
limits.set(id, pLimit(1)); // only one active request per user
}
const limit = limits.get(id);
return limit(impl, user, file); // schedule impl for execution
}
// TODO clean up limits to prevent memory leak?

React-Redux Javascript app - trying to get lots of API calls to fire in order

I'm having an issue with successive API calls (using JQuery's AJAX) to two different APIs in order to build objects with certain attributes. Here's the summary of my app and what I'm trying to do:
The user enters in the name of an actor or director, and the app is meant to return a total of five movies, each of which has certain attributes like title, overview, year, budget, revenue, and a link to a YouTube preview. I'm using The Movie Database API, plus the YouTube API for the YouTube link.
Here's the order of how things currently to work, with all of this happening in the action creator of the Redux app:
Actor name gets sent to the TMDB API -- returns ActorID number
ActorID number gets sent to the TMDB API -- returns 20 movies with: title, overview, year, poster link, and MovieID number
For EACH movie in that list, the MovieID number gets sent to the API -- returns more attributes: budget, revenue, and IMDB-ID (to use in a link later)
Also for EACH movie in step 2, the title gets sent to the YouTube API -- returns a link to the preview.
Once all of this information is assembled for each movie, I want to return the first five movies and dispatch them as the action payload to the Redux store.
I'm using some promises, and I've tried everything I could think of in terms of rearranging the flow of functions, but I can't get all the information I need with one click of the submit button. The funny thing is, it works with TWO clicks of the submit button, I think because by then all the async AJAX calls are finally done. But after the first click, I have an empty array where the movie objects should be.
Here's some code that should summarize what things look like:
var personId
var movies = []
function actorByRating(UserInput) {
Step 1: get actor ID number:
function searchActors() {
return $.ajax({
method: "GET",
url: `https://api.themoviedb.org/3/search/person?query=${UserInput}&api_key=<key>`
}).done(function(response){
personId = response.results[0].id
})
}
Step 2: Use Actor ID to get list of movies, start assigning them attributes:
function getMovies() {
$.ajax({
method: "GET",
url: `https://api.themoviedb.org/3/discover/movie?with_cast=${personId}&vote_count.gte=20&sort_by=vote_average.asc&budget.desc&api_key=<key>&include_image_language=en`
}).done(function(response) {
response.results.forEach((m) => {
var movie = {}
movie.title = m.title
movie.year = m.release_date.split("-")[0]
movie.movieId = m.id
movie.overview = m.overview
movie.poster = "http://image.tmdb.org/t/p/w500" + m.poster_path
getMovieInfo(movie) //step 3
getYouTube(movie) //step 4
saveMovie(movie)
})
})
}
function saveMovie(movie){
movies.push(movie)
}
Step 3 function, takes in a movie object as an argument:
function getMovieInfo(m){
return $.ajax({
method: "GET",
url: `https://api.themoviedb.org/3/movie/${m.movieId}?&api_key=<key>&append_to_results=imdb_id`
}).done(function(response) {
m.revenue = response.revenue
m.budget = response.budget
m.imdbId = response.imdb_id
})
}
Step 4 function, to get Youtube link. Also takes a movie object:
function getYouTube(movie){
$.ajax({
method: "GET",
url: `https://www.googleapis.com/youtube/v3/search?part=snippet&q=${movie.title.split(" ").join("+")}+trailer&key=<key>`
}).done(function(yt){
movie.youtubeLink = `http://www.youtube.com/embed/${yt.items[0].id.videoId}`
})
}
After this, the filtering functions work fine, when they have an array of movies to work with. The problem is, I think, all these successive API calls keep firing before the previous ones are done, and the latter ones need info from the earlier ones to search with. Thus, when I click submit the first time, the final movies array is empty, so the dispatched payload is an empty array. THEN the movie objects get filled in, so when you click submit again, the movies are already there to work with, and the rest of the app works fine.
I've tried everything I can think of to slow the process down, chain promises together (which doesn't work because Step 2 has to run for several movies, i.e. the return values of each function keep changing, so I can't ".then" them), reorganizing the information that comes in...but I can't get it to give me movie objects with all the attributes I need by the time the filtering functions actually run to create the proper payload.
Any help or suggestions would be greatly appreciated!
(Note: the "key" stuff above is just placeholder text)
UPDATE:
I changed the code to basically the following:
searchActors()
.then(function(response){
const actorId = response.data.results[0].id
return actorId
})
.then((personID) => {
return getMoviesFromPersonID(personID)
})
.then(function(response) {
const movieList = []
response.data.results.forEach((m) => {
var movie = {}
movie.title = m.title
movie.year = m.release_date.split("-")[0]
movie.movieId = m.id
movie.overview = m.overview
movie.poster = "http://image.tmdb.org/t/p/w500" + m.poster_path
movieList.push(movie)
// saveMovie(movie)
})
return Promise.all(movieList)
})
.then((movieList) => {
const deepMovieList = []
movieList.forEach((movie) => {
getMovieInfo(movie)
.then(function(response) {
movie.revenue = response.data.revenue
movie.budget = response.data.budget
movie.imdbId = response.data.imdb_id
deepMovieList.push(movie)
})
})
return Promise.all(deepMovieList)
})
.then((deepMovieList) => {
const finalMovies = []
deepMovieList.forEach((movie) => {
finalMovies.push(getYouTube(movie))
})
return Promise.all(finalMovies)
})
Everything works fine right up until the first mention of "deepMovieList". I can't seem to figure out how to have that particular step to work properly, as it essentially involves making 20 API calls with each movie in the movieList. I can't figure out how to 1) get the info back from the API, 2) assign the attributes to the movie object that is passed in to getMovieInfo, and then 3) push that movie object (with the new attributes) to an array that I can use Promise.all on, all without interrupting the promise chain.
Either it moves on to the next "then" function too early (while deepMovieList is still an empty array), or, with other random stuff I've tried, the array ends up being undefined.
How can I have the next "then" function wait until 20 API calls have been made and each movie object has its updated attributes? This will also run into the same problem in the next step, for the YouTube link.
Thanks!

TL;DR: use fetch and promises instead of jQuery, group promises using Promise.all.
The Longer Version
Ok, I'm not going to repeat all your code here. I'm going to abbreviate some stuff to keep it simple.
Basically, you have a bunch of tasks to perform. I'm going to pretend each of them is a function that returns a promise which is resolved with the data you want.
searchActor() - returns a promise resolved with some ID number
getMoviesFromActorID(actorId) - returns a promise that resolves with an array of movie IDs
getMovie(movieId) - returns a promise that resolves with the details for the given movie ID
getYoutube(movie) - returns a promise that resolves with the Youtube embed code.
Given this basic setup (and I admit I'm leaving out a lot of stuff), the code looks like this:
// search for an actor
searchActor('Brad Pitt')
// then get the movie IDs for that actor
.then((actorId) => getMoviesFromActorID(actorId))
// then iterate over the list of movie IDs & build an array of
// promises. Use Promise.all to create a new Promise which is
// resolved when all are resolved
.then((movieIdList) => {
const promiseList = [];
movieIdList.forEach((id) => promiseList.push(getMovie(id)));
return Promise.all(promiseList);
})
// then get Youtube links for each of the movies
.then((movieDetailsList) => {
const youtubeList = [];
movieDetailsList.forEach((movie) => youtubeList.push(getYoutube(movie)));
return Promise.all(youtubeList);
})
// then do something with all the information you've collected
.then((finalResults) => {
// do something interesting...
});
The key to this is Promise.all (documentation here), which will take an array of Promises (or any other iterable object containing promises) and create a new Promise which will resolve when all of the original promises have resolved. By using Promise.all, you can create a step in your promise chain which can include a variable number of parallel actions which must complete before the next step.
You could do something like this will jQuery and callbacks, but it would be pretty darn ugly. One of the great benefits of promises is the ability to lay out a series of steps like above.

Develop Reference

JavaScript is the programming language of the Web.

slow looping over pages and extracting data using puppeteer - javascript

Related

Puppeteer Page.$$eval() method returning empty arrays

Should I be storing poke-api data locally?

Node JS Auto Reload

Synchronize critical section in API for each user in JavaScript

React-Redux Javascript app - trying to get lots of API calls to fire in order

Categories

Resources