How to pull data from Paginated JSON - javascript

I have say 300 items 10 show to a page. The page loads the JSON data and is limited to 10 (this cannot be changed)
I want to scrub through the 30 odd pages pulling each item and listing it.
url.com/api/some-name?page=1 etc
The script ideally will use the above URL as a rule and scrub through increments of 1 until all 10 from each page is populated.
Can this be done? How would I go about it? Any advice or assistance to this would help me greatly in learning and looking at methods people suggest.
const getInfo = async function(pageNo) {
const jsonUrl = "https://website.com/api/some-title";
let actualUrl = jsonUrl + `?page=${pageNo}`;
let jsonResults = await fetch(actualUrl).then(response => {
return response.json();
});
return jsonResults;
};
const getEntireList = async function(pageNo) {
const results = await getInfo(pageNo);
console.log("Retreiving data from API for page:" + pageNo);
if (results.length > 0) {
return results.concat(await getEntireList(pageNo));
} else {
return results;
}
};
(async () => {
const entireList = await getEntireList();
console.log(entireList);
})();

I can see some issues in your code.
the initial call to getEntireList() should be initialised with the index of first page, maybe like const entireList = await getEntireList(1);
The page number will need to be incremented at some point.
results.concat() probably won't have the desired effect. json() returns an object, list, or value (depending on the server) and results will be one of those type. concat() operates on strings; so calling json() is (at best) redundant.

Related

Node fetch loop too slow

I have an API js file which I call with a POST method, passing in an array of objects which each contains a site url (about 26 objects or urls) as the body, and with the code below I loop through this array (sites) , check if each object url returns a json by adding to the url the "/items.json" , if so push the json content into another final array siteLists which I send back as response.
The problem is for just 26 urls, this API call takes more than 5 seconds to complete, am I doing it the wrong way or is it just the way fetch works in Node.js?
const sites content looks like:
[{label: "JonLabel", name: "Jon", url: "jonurl.com"},{...},{...}]
code is:
export default async (req, res) => {
if (req.method === 'POST') {
const body = JSON.parse(req.body)
const sites = body.list // this content shown above
var siteLists = []
if (sites?.length > 0){
var b=0, idd=0
while (b < sites.length){
let url = sites?.[b]?.url
if (url){
let jurl = `${url}/items.json`
try {
let fUrl = await fetch(jurl)
let siteData = await fUrl.json()
if (siteData){
let items = []
let label = sites?.[b]?.label || ""
let name = sites?.[b]?.name || ""
let base = siteData?.items
if(base){
var c = 0
while (c < base.length){
let img = base[c].images[0].url
let titl = base[c].title
let obj = {
url: url,
img: img,
title: titl
}
items.push(obj)
c++
}
let object = {
id: idd,
name: name,
label: label,
items: items
}
siteLists.push(object)
idd++
}
}
}catch(err){
//console.log(err)
}
}
b++
}
res.send({ sites: siteLists })
}
res.end()
}
EDIT: (solution?)
So it seems the code with promises as suggested below and marked as the solution works in the sense that is faster, the funny thing tho is it still takes more than 5 secs to load and still throws a Failed to load resource: the server responded with a status of 504 (Gateway Time-out) error, since Vercel, where the app is hosted passed to a max timeout of 5 secs for serverless functions, therefore never loading the content in the response. Locally, where I got no timeout limits is visibly faster to load, but it surprises me that such a query takes so long to complete where it should be a matter of ms.
The biggest problem I see here is that you appear to be awaiting for one fetch to complete before you loop through to start the next fetch request, effectively running them serially. If you rewrote your script to run all of the simultaneously in parallel, you could push each request sequentially into a Promise.all and then process the results when they return.
Think of it like this-- if each request took a second to complete, and you have 26 requests, and you wait for one to complete before starting the next, it will take 26 seconds altogether. However, if you run them each all together, if they still each take only one second to complete the whole thing altogether will take just one second.
An example in psuedocode--
You want to change this:
const urls = ['url1', 'url2', 'url3'];
for (let url of urls) {
const result = await fetch(url);
process(result)
}
...into this:
const urls = ['url1', 'url2', 'url3'];
const requests = [];
for (let url of urls) {
requests.push(fetch(url));
}
Promise.all(requests)
.then(
(results) => results.forEach(
(result) => process(result)
)
);
While await is a great sugar, sometimes it's better to stick with then
export default async (req, res) => {
if (req.method === 'POST') {
const body = JSON.parse(req.body)
const sites = body.list // this content shown above
const siteListsPromises = []
if (sites?.length > 0){
var b=0
while (b < sites.length){
let url = sites?.[b]?.url
if (url) {
let jurl = `${url}/items.json`
// #1
const promise = fetch(jurl)
// #2
.then(async (fUrl) => {
let siteData = await fUrl.json()
if (siteData){
...
return {
// #3
id: -1,
name: name,
label: label,
items: items
}
}
})
// #4
.catch(err => {
// console.log(err)
})
siteListsPromises.push(promise)
}
b++
}
}
// #5
const siteLists = (await Promise.all(siteListsPromises))
// #6
.filter(el => el !== undefined)
// #7
.map((el, i) => ({ id: i, ...el }))
res.send({ sites: siteLists })
}
res.end()
}
Look for // #N comments in the snippet.
Don't await for requests to complete. Instead iterate over sites and send all requests at once
Chain json() and siteData processing after the fetch with then. And should your processing of siteData be more computational heavy it'd have even more sense to do so, instead of performing all of it only after all promises resolve.
If you (or someone on your team) have some troubles with understanding closures, don't bother setting the id of siteData elements in the cycle. I won't dive in this, but will address it further.
use .catch() instead of try{}catch(){}. Because without await it won't work.
await results of all requests with the Promise.all()
filter out those where siteData was falsy
finally set the id field.

How to take two fetches and compare their data as variables whilst not getting any errors?

I want to take two different fetches, put them into a variable form so their important data can be used for something else other than just logging the data.
I'm trying to do this via window response async, however, I am currently at a dead end because though what I'm doing works on one strand of data, it doesn't work on two because of the JSON body stream already read error.
let RESPONSE = window.Response.prototype.json;
window.Response.prototype.json = async function () {
if (!('https://a/')) return RESPONSE.call(this)
let x = await RESPONSE.call(this);
if (!('https://b/')) return RESPONSE.call(this)
let y = await RESPONSE.call(this);
for (let detect in x) {
if (x[detect] !== y[detect]) {
console.log(x[detect]);
console.log(y[detect]);
}
}
return x;
return y;
};
How can I keep the data in a variable form that can be used for something like this:
for (let detect in x) {
if (x[detect] !== y[detect]) {
console.log(x[detect]);
console.log(y[detect]);
}
but whilst being able to have both variables defined at the same time? This would mean I would need to get past the body stream error while also keeping that core code. How can I do that?
Does this help you?
async function doTwoRequestsAndCompareStatus() {
const res1 = await fetch("https://fakejsonapi.com/fake-api/employee/api/v1/employees");
const res2 = await fetch("https://fakejsonapi.com/fake-api/employee/api/v1/employees");
const data1 = await res1.json();
const data2 = await res2.json();
console.log('both were equal?', data1.status === data2.status);
}
// Don't forget the actually call the function πŸ˜‰
doTwoRequestsAndCompareStatus();
Although I would recommend this, both because it's cleaner and faster, as the fetches are executed at the same time and not sequentially.
async function doTwoRequestsAndCompareStatus() {
const [data1, data2] = await Promise.all([
fetch("https://fakejsonapi.com/fake-api/employee/api/v1/employees").then(r => r.json()),
fetch("https://fakejsonapi.com/fake-api/employee/api/v1/employees").then(r => r.json()),
]);
console.log('both were equal?', data1.status === data2.status);
}
// Don't forget the actually call the function πŸ˜‰
doTwoRequestsAndCompareStatus();
If you find the first one easier to understand though, I would recommend using it πŸ‘.

Cannot get querySelectorAll to work with puppeteer (returns undefined)

I'm trying to practice some web scraping with prices from a supermarket. It's with node.js and puppeteer. I can navigate throught the website in beginning with accepting cookies and clicking a "load more button". But then when I try to read div's containing the products with querySelectorAll I get stuck. It returns undefined even though I wait for a specific div to be present. What am I missing?
Problem is at the end of the code block.
const { product } = require("puppeteer");
const scraperObjectAll = {
url: 'https://www.bilkatogo.dk/s/?query=',
async scraper(browser) {
let page = await browser.newPage();
console.log(`Navigating to ${this.url}`);
await page.goto(this.url);
// accept cookies
await page.evaluate(_ => {
CookieInformation.submitAllCategories();
});
var productsRead = 0;
var productsTotal = Number.MAX_VALUE;
while (productsRead < 100) {
// Wait for the required DOM to be rendered
await page.waitForSelector('button.btn.btn-dark.border-radius.my-3');
// Click button to read more products
await page.evaluate(_ => {
document.querySelector("button.btn.btn-dark.border-radius.my-3").click()
});
// Wait for it to load the new products
await page.waitForSelector('div.col-10.col-sm-4.col-lg-2.text-center.mt-4.text-secondary');
// Get number of products read and total
const loadProducts = await page.evaluate(_ => {
let p = document.querySelector("div.col-10.col-sm-4.col-lg-2").innerText.replace("INDLÆS FLERE", "").replace("Du har set ","").replace(" ", "").replace(/(\r\n|\n|\r)/gm,"").split("af ");
return p;
});
console.log("Products (read/total): " + loadProducts);
productsRead = loadProducts[0];
productsTotal = loadProducts[1];
// Now waiting for a div element
await page.waitForSelector('div[data-productid]');
const getProducts = await page.evaluate(_ => {
return document.querySelectorAll('div');
});
// PROBLEM HERE!
// Cannot convert undefined or null to object
console.log("LENGTH: " + Array.from(getProducts).length);
}
The callback passed to page.evaluate runs in the emulated page context, not in the standard scope of the Node script. Expressions can't be passed between the page and the Node script without careful considerations: most importantly, if something isn't serializable (converted into plain JSON), it can't be transferred.
querySelectorAll returns a NodeList, and NodeLists only exist on the front-end, not the backend. Similarly, NodeLists contain HTMLElements, which also only exist on the front-end.
Put all the logic that requires using the data that exists only on the front-end inside the .evaluate callback, for example:
const numberOfDivs = await page.evaluate(_ => {
return document.querySelectorAll('div').length;
});
or
const firstDivText = await page.evaluate(_ => {
return document.querySelector('div').textContent;
});

How do I loop through multiple pages in an API?

I am using the Star Wars API https://swapi.co/ I need to pull in starships information, the results for starships span 4 pages, however a get call returns only 10 results per page. How can I iterate over multiple pages and get the info that I need?
I have used the fetch api to GET the first page of starships and then added this array of 10 to my totalResults array, and then created a While loop to check to see if 'next !== null' (next is the next page property in the data, if we were viewing the last page i.e. page 4, then next would be null "next" = null) So as long as next does not equal null, my While loop code should fetch the data and add it to my totalResults array. I have changed the value of next at the end, but it seems to looping forever and crashing.
function getData() {
let totalResults = [];
fetch('https://swapi.co/api/starships/')
.then( res => res.json())
.then(function (json) {
let starships = json;
totalResults.push(starships.results);
let next = starships.next;
while ( next !== null ) {
fetch(next)
.then( res => res.json() )
.then( function (nextData) {
totalResults.push(nextData.results);
next = nextData.next;
})
}
});
}
Code keeps looping meaning my 'next = nextData.next;' increment does not seem to be working.
You have to await the response in a while loop, otherwise the loop runs synchronously, while the results arrive asynchronously, in other words the while loop runs forever:
async getData() {
const results = [];
let url = 'https://swapi.co/api/starships/';
do {
const res = await fetch(url);
const data = await res.json();
url = data.next;
results.push(...data.results);
} while(url)
return results;
}
You can do it with async/await functions more easily:
async function fetchAllPages(url) {
const data = [];
do {
let response = fetch(url);
url = response.next;
data.push(...response.results);
} while ( url );
return data;
}
This way you can reutilize this function for other api calls.

NodeJS - Map forEach Async/Await

I am working with node. I have an array of ids. I want to filter them based on a response of a call of other API. So i want to populate each id and know if they assert or not the filter i am doing based on the API.
I am using async/await. I found that the best approach is using Promises.all, but this is not working as expected. What i am doing wrong?
static async processCSGOUsers (groupId, parsedData) {
let steamIdsArr = [];
const usersSteamIds = parsedData.memberList.members.steamID64;
const filteredUsers = await Promise.all(usersSteamIds.map(async (userId) => {
return csGoBackpack(userId).then( (response) => {
return response.value > 40;
})
.catch((err) => {
return err;
});
}));
Object.keys(usersSteamIds).forEach(key => {
steamIdsArr.push({
steam_group_id_64: groupId,
steam_id_64: usersSteamIds[key]
});
});
return UsersDao.saveUsers(steamIdsArr);
}
Apart from that, it is happening something weird. When i was debbuging this, data parameters on this method is coming fine. When i reach on the line of the Promise.all i got a "reference error" on each parameter. I do not why.
Wait for all responses, then filter based on the results:
const responses = await Promise.all(usersSteamIds.map(csGoBackpack));
// responses now contains the array of responses for each user ID
// filter the user IDs based on the corresponding result
const filteredUsers = usersSteamIds.filter((_, index) => responses[index].value > 40);
If you don't mind using a module, you can do this kind of stuff in a straightforward way using these utilities

Categories

Resources