Insert new key value pair inside and array of objects, but value is created by axios.get - javascript

So I've been working on a scraper. Everything was well until I've tried scraping data for individual link.
Now to explain: I've got a scraper, which scrapes me data about apartments. Now first url is page where the articles are located(approx. 29-30 should be fetched). Now on that page I don't have information about square meters, so I need to run another scraper for each link that is scraped, and scrape square meters from there.
Here is the code that I have:
const axios = require('axios');
const cheerio = require('cheerio');
const url = `https://www.olx.ba/pretraga?vrsta=samoprodaja&kategorija=23&sort_order=desc&kanton=9&sacijenom=sacijenom&stranica=2`;
axios.get(url).then((response) => {
const articles = [];
const $ = cheerio.load(response.data);
$('div[id="rezultatipretrage"] > div')
.not('div[class="listitem artikal obicniArtikal i index"]')
.not('div[class="obicniArtikal"]')
.each((index, element) => {
$('span[class="prekrizenacijena"]').remove();
const getLink = $(element).find('div[class="naslov"] > a').attr('href');
const getDescription = $(element)
.find('div[class="naslov"] > a > p')
.text();
const getPrice = $(element)
.find('div[class="datum"] > span')
.text()
.replace(/\.| ?KM$/g, '')
.replace(' ', '');
const getPicture = $(element)
.find('div[class="slika"] > img')
.attr('src');
articles[index] = {
id: getLink.substring(27, 35),
link: getLink,
description: getDescription,
price: getPrice,
picture: getPicture,
};
});
articles.map((item, index) => {
axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
});
});
console.log(articles);
});
Now the first part of the code likes as it should, I've been struggling with this second part.
Now I'm mapping over articles because there, for each link, I need to load it into axios function and get the data about square meters.
So my desired output would be updated articles: with it's old objects and key values inside it but with key sqm and value of scraped sqaure meters.
Any ideas on how to achieve this?
Thanks!

You could simply add the information about the square meters to the current article/item, something like:
const articlePromises = Promise.all(articles.map((item) => {
return axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
item.sqm = sqa;
});
}));
articlePromises.then(() => {
console.log(articles);
});
Note that you need to wait for all mapped promises to resolve, before you log the resulting articles.
Also note that using async/await you could rewrite your code to be a bit cleaner, see https://javascript.info/async-await.

Related

How to use Cheerio in NodeJS to scrape img srcs

This is the code
const getHappyMovies = async () => {
try {
const movieData = [];
let title;
let description;
let imageUrl;
const response = await axios.get(happyUrl); //https://www.imdb.com/list/ls008985796/
const $ = load(response.data);
const movies = $(".lister-item");
movies.each(function () {
title = $(this).find("h3 a").text();
description = $(this).find("p").eq(1).text();
imageUrl = $(this).find("a img").attr("src");
movieData.push({ title, description, imageUrl });
});
console.log(movieData);
} catch (e) {
console.error(e);
}
};
Here's the output I'm receiving:
And this is the website I'm scraping from
Now I need to get the src of that image, but it's returning something else, as shown in the output image.
The golden rule of Cheerio is "it doesn't run JS". As a result, devtools is often inaccurate since it shows the state of the page after JS runs.
Instead, either look at view-source:, disable JS or look at the HTML response printed from your terminal to get a more accurate sense of what's actually on the page (or not).
Looking at the source:
<img alt="Tokyo Story"
class="loadlate"
loadlate="https://m.media-amazon.com/images/M/MV5BYWQ4ZTRiODktNjAzZC00Nzg1LTk1YWQtNDFmNDI0NmZiNGIwXkEyXkFqcGdeQXVyNzkwMjQ5NzM#._V1_UY209_CR2,0,140,209_AL_.jpg"
data-tconst="tt0046438"
height="209"
src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png"
width="140" />
You can see src= is a placeholder image but loadlate is the actual URL. When the image is scrolled into view, JS kicks in and lazily loads the loadlate URL into the src attribute, leading to your observed devtools state.
The solution is to use .attr("loadlate"):
const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
const getHappyMovies = () =>
axios.get(url).then(({data: html}) => {
const $ = cheerio.load(html);
return [...$(".lister-item")].map(e => ({
title: $(e).find(".lister-item-header a").text(),
description: $(e).find(".lister-item-content p").eq(1).text().trim(),
imageUrl: $(e).find(".lister-item-image img").attr("loadlate"),
}));
});
getHappyMovies().then(movies => console.log(movies));
Note that I'm using class selectors which are more specific than plain tags.

How to merge arrays that are passed into the function?

So I've been working on my project that involves scraper.
So workflow is next: There are two scrapers right now. Data is being parsed and pushed into array for each individual scraper and passed to the merge component.
So the merge component looks like this:
let mergedApartments = []; //Creating merged list of apartments
exports.mergeData = (apartments) => {
//Fetching all apartments that are passed from scraper(s)
mergedApartments.push(...apartments); //Pushing apartments into the list
console.log(mergedApartments.length);
};
So right now output of mergedApartments.length is 9 39. So the first function that calls mergeData() and pass it an array have 9 objects inside it, and the other scraper have 30 objects inside it's array, who is again passed to the mergeData.
Now this is not what I've expected. I've expected one array with all merged objects from the scrapers. Right now, scraperno1 send apartments and it's added to the mergedApartments, then scraperno2 sends apartments and it's overwriting that array by adding new apartments objects into the array.
Now I want different output: I just want to get one list with all merged objects from the arrays. Because this data will be passed to the storing component, and I don't want to query DB multiple times, because for each new mergedApartments list, data will be inserted and creating duplicate values - throwing an error.
So what I've tried: I've tried creating some kind of a counter which counts number of time that function mergeData is called, and then do the logic about merging but no success.
So I just want my array to have one output of mergedApartments.length - in this case 39.
Thanks!
EDIT
Here how one of the scraper looks:
const merge = require('../data-functions/mergeData');
const axios = require('axios');
const cheerio = require('cheerio');
//function for olx.ba scraper. Fetching raw html data and pushing it into array of objects. Passing data to merge function
exports.santScraper = (count) => {
const url = `https://www.sant.ba/pretraga/prodaja-1/tip-2/cijena_min-20000/stranica-${count}`;
const santScrapedData = [];
const getRawData = async () => {
try {
await axios.get(url).then((response) => {
const $ = cheerio.load(response.data);
$('div[class="col-xxs-12 col-xss-6 col-xs-6 col-sm-6 col-lg-4"]').each(
(index, element) => {
const getLink = $(element).find('a[class="re-image"]').attr('href');
const getDescription = $(element).find('a[class="title"]').text();
const getPrice = $(element)
.find('div[class="prices"] > h3[class="price"]')
.text()
.replace(/\.| ?KM$/g, '')
.replace(',', '.');
const getPicture = $(element).find('img').attr('data-original');
const getSquaremeters = $(element)
.find('span[class="infoCount"]')
.first()
.text()
.replace(',', '.')
.split('m')[0];
const pricepersquaremeter =
parseFloat(getPrice) / parseFloat(getSquaremeters);
santScrapedData[index] = {
id: getLink.substring(42, 46),
link: getLink,
descr: getDescription,
price: Math.round(getPrice),
pictures: getPicture,
sqm: Math.round(getSquaremeters),
ppm2: Math.round(pricepersquaremeter),
};
}
);
merge.mergeData(santScrapedData); //here i'm calling function and passing array to function
});
} catch (error) {
console.log(error);
}
};
getRawData();
};
Other scraper looks the same(it's same calling of the function)
For this, you need to use concat function from the Array prototype
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/concat
exports.mergeData = (apartments) => {
mergedApartment = mergedApartments.concat(apartments);
};
exports.sendData = () => {
console.log(mergedApartment.length);
}
and in your main script
getRawData().then(merge.sendData);

Running a request inside of a request (Node & Cheerio)

I am building a web scraper that uses node, request, and cheerio.
The function I have written takes a number as a parameter that gets used as an index in a for loop, and then each number in the loop corresponds to a url on a website that contains an index of blog posts. From there the function returns the title of every blog post on on each url, the href contained in each post title, and then another request is ran that returns every href contained on each individual posts page, using the href of the corresponding post title as an input value. So my output in the terminal should be formatted like this:
Title: Some Blog Post Title 1
Link: Some Blog Post Link 1
Blog Post Links: List of Links on Blogs Page 1
Title: Some Blog Post Title 2
Link: Some Blog Post Link 2
Blog Post Links: List of Links on Blogs Page 2
But instead it is coming out like this:
Title: Some Blog Post Title 1
Link: Some Blog Post Link 1
Title: Some Blog Post Title 2
Link: Some Blog Post Link 2
Giant list of blog post links
So my code is functional in that it retrieves all of the correct information for me, but it is not in the right format. The current output isn't helpful for me because I need to be able to tell which links correspond to each page rather than a giant list of links.
I have researched my problem and I'm pretty sure that this is happening because of the asynchronous nature of the code. My function is very similar to the question posed here but mine is different in the sense that there is a second request being ran using the output from the first request as an input in addition to the loop.
So my question is how to reformat my code to get my output to return in the desired order?
function scrapeUrls(num) {
for (var i = 1; i <= num ; i++) {
request(`https://www.website.com/blog?page=${i}`, (error, response, html) => {
if(!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.group-right').each((i, el) => {
const articleTitle = $(el)
.find('h2')
.text();
const articleLink = $(el)
.find('a')
.attr('href');
console.log(`Title: ${articleTitle}\nLink: ${articleLink}\nBlog Post Links:`)
request(`https://www.website.com/${articleLink}`, (error, response, html) => {
if(!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.main-container').each((i) => {
links = $('a');
$(links).each(function(i, link) {
console.log($(link).text() + ':\n ' + $(link).attr('href'));
})
})
}
})
});
}
})
}
}
As #Dshiz pointed out you need to await the promise if you want to keep the order. I suggest you use node-fetch instead of request which returns actual promise to be awaited:
let cheerio = require('cheerio');
let fetch = require('node-fetch');
function getArticlesLinks(html) {
const $ = cheerio.load(html);
let articles = [];
$(".group-right").each((i, el) => {
const articleTitle = $(el).find("h2").text();
const articleLink = $(el).find("a").attr("href");
console.log(`Title: ${articleTitle}\nLink: ${articleLink}\nBlog Post Links:`);
articles.push(articleLink);
});
return articles;
}
function getLinks(html) {
const $ = cheerio.load(html);
$(".main-container").each((i) => {
links = $("a");
$(links).each(function (i, link) {
console.log(
$(link).text() + ":\n " + $(link).attr("href")
);
});
});
}
async function scrapeUrls(num) {
for (var i = 1; i <= num; i++) {
// Fetch the page
let pageResponse = await fetch(`https://www.website.com/blog?page=${i}`);
if (pageResponse.ok) {
let pageHtml = await pageResponse.text();
// ^ HERE
// Extract articles' links
let articles = getArticlesLinks(pageHtml);
// For each article fetch and extract links
for (let a of articles) {
let articleResponse = await fetch(`https://www.website.com/${a}`);
if (articleResponse.ok) {
let articleHtml = await articleResponse.text();
// ^ HERE
getLinks(articleHtml);
}
}
}
}
}
scrapeUrls(4)
.then(() => console.log('done'))
.catch(console.error)
Here I turned the scrapeUrls function into async so I can await inside each for of loop.

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks
Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

Get the ChildrenCount of a child in Firebase using JavaScript

I have been doing this for an hour. I simply want to get the number of children in the child "Success" in the database below. The answers in similar stackoverflow questions are not working. I am new in Javascript Programming.
So far I have tried this
var children = firebase.database().ref('Success/').onWrite(event => {
return event.data.ref.parent.once("value", (snapshot) => {
const count = snapshot.numChildren();
console.log(count);
})
})
and also this
var children = firebase.database().ref('Success/').onWrite(event => {
return event.data.ref.parent.once("value", (snapshot) => {
const count = snapshot.numChildren();
console.log(count);
})
})
Where might I be going wrong.
As explained in the doc, you have to use the numChildren() method, as follows:
var ref = firebase.database().ref("Success");
ref.once("value")
.then(function(snapshot) {
console.log(snapshot.numChildren());
});
If you want to use this method in a Cloud Function, you can do as follows:
exports.children = functions.database
.ref('/Success')
.onWrite((change, context) => {
console.log(change.after.numChildren());
return null;
});
Note that:
The new syntax for Cloud Functions version > 1.0 is used, see https://firebase.google.com/docs/functions/beta-v1-diff?authuser=0
You should not forget to return a promise or a value to indicate to the platform that the Cloud Function execution is completed (for more details on this point, you may watch the 3 videos about "JavaScript Promises" from the Firebase video series: https://firebase.google.com/docs/functions/video-series/).
const db = getDatabase(app)
const questionsRef = ref(db, 'questions')
const mathematicalLiteracy = child(questionsRef, 'mathematicalLiteracy')
onValue(mathematicalLiteracy, (snapshot) => {
const data = snapshot.val()
const lenML = data.length - 1
console.log(lenML)
})
This method worked for me. I wanted to get the children's count of the mathematicalLiteracy node in my database tree. If I get its value using .val() it returns an array that contains that node's children and an extra empty item. So, I subtracted that one empty item's count. Finally, I get my needed children's count.

Categories

Resources