Web scraping and promises - javascript

I am using cheerio and node to do web scraping, but I have a problem with promises. I can scrape an article list from a page but in that list, we have more links for single pages. I need to scrape single pages as well for each item on the list.
I will show you my code for the better solution.
import rp from 'request-promise'
import cheerio from 'cheerio'
import conn from './connection'
const flexJob = `https://www.flexjobs.com`
const flexJobCategory = ['account-management', 'bilingual']
class WebScraping {
//list of article e.g for page 2
results = [] // [[title], [link for page],...]
contentPage = [] //content for each page
scrapeWeb(link) {
let fullLink = `${link}/jobs/${flexJobCategory[1]}?page=2`
const options = {
uri: fullLink,
transform(body) {
return cheerio.load(body)
}
}
rp(options)
.then(($) => {
console.log(fullLink)
$('.featured-job').each((index, value) => {
//html nodes
let shortDescription = value.children[1].children[1].children[3].children[1].children[1].children[0].data
let link = value.children[1].children[1].children[1].children[1].children[1].children[0].attribs.href
let pageLink = flexJob + '' + link
let title = value.children[1].children[1].children[1].children[1].children[1].children[0].children[0].data
let place = value.children[1].children[1].children[1].children[1].children[3].children[1].data
let jobType = value.children[1].children[1].children[1].children[1].children[3].children[0].children[0].data
this.results.push([title, '', pageLink.replace(/\s/g, ''), '', shortDescription.replace(/\n/g, ''), place, jobType, 'PageContent::: '])
})
})
.then(() => {
this.results.forEach(element => {
console.log('link: ', element[2])
this.scrapePage(element[2])
});
})
.then(() => {
console.log('print content page', this.contentPage)
})
.then(() => {
//this.insertIntoDB()
console.log('insert into db')
})
.catch((err) => {
console.log(err)
})
}
/**
* It's going to scrape all pages from list of jobs
* #param {Any} pageLink
* #param {Number} count
*/
scrapePage(pageLink) {
let $this = this
//console.log('We are in ScrapePage' + pageLink + ': number' + count)
//this.results[count].push('Hello' + count)
let content = ''
const options = {
uri: pageLink,
transform(body) {
return cheerio.load(body)
}
}
rp(options)
.then(($) => {
//this.contentPage.push('Hello' + ' : ');
console.log('Heloo')
})
.catch((err) => {
console.log(err)
})
}
/**
* This method is going to insert data into Database
*/
insertIntoDB() {
conn.connect((err) => {
var sql = "INSERT INTO contact (title, department, link, salary, short_description, location, job_type, page_detail) VALUES ?"
var values = this.results
conn.query(sql, [values], function (err) {
if (err) throw err
conn.end()
})
})
}
}
let webScraping = new WebScraping()
let scrapeList = webScraping.scrapeWeb(flexJob)
So, at 'scrapeWeb' method, at second '.then', I am calling 'scrapePage' method, however, the third promise executed before promise inside 'scrapePage' method.

You need a little more control flow at that stage. You do not want that .then()'s promise to resolve until all the calls are resolved.
You could use a Promise library like bluebird to do a Promise.each or a Promise.map for all the results you want to run.
Or use async/await to set up like .then(async () => {}) and do not use .forEach.
for(let element of this.results){
console.log('link: ', element[2])
await this.scrapePage(element[2])
}

You have a race condition problem.
The first tweak you'll need is having scrapePage returning a Promise.
scrapePage(pageLink) {
let $this = this
let content = ''
const options = {
uri: pageLink,
transform(body) {
return cheerio.load(body)
}
}
return rp(options);
}
In the second than, you need to invoke all child pages scraping eg :
.then(() => {
return Promise.all(this.results.map(childPage => this.scrapePage(childPage)));
})
This will wrap all scrapes of child pages into promises and only if all of them are resolved the code will flow.

Related

How to filter two arrays of splitting?

I'm a bit confused
I am sending emails with nodemailer, and every time I send one I perform certain validations in order to manage the upload limit of the attachments. If the upload limit exceeds what is established, the service divides that email and sends it in different emails with the same subject and body as well as its attachment.
Every time this happens, it does a _.chunk that takes care of splitting the pdfs array into smaller elements. But, it should be noted that before that, he made a method to prepare the attachments and this is in charge of obtaining certain information from the api to paint the pdf buffer and thus put it in the content of the emails.
But now what I want to do is search within the matrix that performs the step before dividing the files those that are equal to the array that obtains the information and if they are equal, carry out the instruction that it sends
I will explain with a graph:
If getAmount.pdfBuffer === attachmentMap
// doAction console.log('Equals)
But even though I tried to do it, I couldn't, I don't know if it's because for each attachment that the array has divided, it generates a getAmount array. What do you think I'm doing wrong?
async sendEmail(
{
para: to,
asunto: subject,
plantilla: template,
contexto: context,
}: CorreoInfoDto,
attachments: EmailAttachment[],
driveConfig: OAuthGoogleConfig
) {
const totalSize: number = this.getSizeFromAttachments(attachments);
const chunkSplit = Math.floor(isNaN(totalSize) ? 1 : totalSize / this.LIMIT_ATTACHMENTS) + 1;
const attachmentsChunk: any[][] = _.chunk(attachments, chunkSplit);
if ((totalSize > this.LIMIT_ATTACHMENTS) && attachmentsChunk?.length >= 1) {
await Promise.all(
attachmentsChunk?.map(async (attachment: EmailAttachment[], index) => {
console.log('attachment', attachment)
if (this.getSizeFromAttachments(attachment) > this.LIMIT_ATTACHMENTS) {
const result: GenerateDriveLinkResponse[] = await Promise.all(
attachment?.map(item => {
const file = new GoogleDriveUploadFile({
name: item?.filename,
mimeType: MimeTypesEnum.PDF,
body: item?.content
});
return this.uploadFilesService.uploadToDrive(driveConfig, file) as any;
})
)
const texto = result?.map((item, index) => {
console.log('item', item?.webViewLink);
console.log('index', index);
return new SolicitudXLinkDrive({
texto: attachment[index].filename,
link: item?.webViewLink
})
});
context.links = texto;
const link = `(${index + 1}/${attachmentsChunk?.length - 1})`;
const newContext = {
getCurrent: link,
...context
}
const prepareEmail = this.prepareEmail({
para: to,
asunto: ` ${subject} (${index + 1}/${attachmentsChunk?.length})`,
plantilla: template,
contexto: newContext,
}, []);
return prepareEmail
} else {
// this.getCantidad = `(${index + 1}/${attachmentsChunk?.length - 1})`;
console.log('getCantidad', this.getAmount );
const attachmentMap = attachment.map(element => element.content);
this.getAmount .forEach(element => {
if (element.pdfBuffer === attachmentMap) {
console.log('do action');
}
})
const link = ` (${index + 1}/${attachmentsChunk?.length - 1})`;
const newContext = {
getCurrent: link,
...context
}
return this.prepareEmail({
para: to,
asunto: ` ${subject} (Correo ${index + 1}/${attachmentsChunk?.length - 1})`,
plantilla: template,
contexto: newContext,
}, attachment);
}
})
);
} else {
await this.prepareEmail(
{
para: to,
asunto: ` ${subject}`,
plantilla: template,
contexto: context,
},
attachments,
);
}
}
async prepareEmail(
{
para: to,
asunto: subject,
plantilla: template,
contexto: context,
}: CorreoInfoDto,
attachments: EmailAttachment[],
) {
return await this.mailerService.sendMail({
to,
from: `${process.env.SENDER_NAME} <${process.env.EMAIL_USER}>`,
subject,
template,
attachments: attachments,
context: context,
});
}
async sendEmails(correos: EnvioMultiplesCorreosDto) {
let pdf = null;
let info: ConfiguracionDocument = null;
let GDriveConfig: ConfiguracionDocument = null;
let logo: ConfiguracionDocument = null;
let forContext = {};
const documents = Array.isArray(correos.documento_id) ? correos.documento_id : [correos.documento_id];
const solicitudes = await this.solicitudesService.findByIds(documents);
const nombresPacientes = solicitudes.reduce((acc, cv) => {
acc[cv.correlativo_solicitud] = cv['info_paciente']?.nombre_paciente;
return acc;
}, {});
await Promise.all([
await this.getPdf(correos.tipo_reporte, correos.documento_id, correos?.dividir_archivos).then(data => { pdf = data; }),
await this.configuracionesService.findByCodes([
ConfigKeys.TEXTO_CORREO_MUESTRA,
ConfigKeys[process.env.DRIVE_CONFIG_API],
ConfigKeys.LOGO_FIRMA_PATMED
]).then(data => {
info = data[0];
GDriveConfig = data[1];
logo = data[2];
})
]);
forContext = this.configuracionesService.castValorObjectToObject(info?.valor_object)
const attachmentPrepare = this.prepareAttachments(pdf as any, nombresPacientes);
await this.sendEmail(
{
para: correos.para,
asunto: correos.asunto,
plantilla: 'muestras',
contexto: {
cuerpo: correos.cuerpo,
titulo: forContext[EnvioCorreoMuestraEnum.titulo],
direccion: forContext[EnvioCorreoMuestraEnum.direccion],
movil: forContext[EnvioCorreoMuestraEnum.movil],
pbx: forContext[EnvioCorreoMuestraEnum.pbx],
email: forContext[EnvioCorreoMuestraEnum.email],
logo: logo?.valor,
},
},
attachmentPrepare,
this.configuracionesService.castValorObjectToObject(GDriveConfig?.valor_object) as any,
);
const usuario = new UsuarioBitacoraSolicitudTDTO();
usuario.createFromUserRequest(this.sharedService.getUserFromRequest());
solicitudes.forEach((solicitud) => {
const actual = new BitacoraSolicitudDTO();
actual.createFromSolicitudDocument(solicitud);
const newBitacora = new CrearBitacoraSolicitudDTO();
newBitacora.createNewItem(null, actual, actual, usuario, AccionesBitacora.EmailEnviado);
this.bitacoraSolicitudesService.create(newBitacora);
});
}
prepareAttachments(item: BufferCorrelativosDTO | BufferXSolicitudDTO[], nombresPacientes: { [key: string]: string }) {
if (this.sharedService.isAnArray(item)) {
const castItem: BufferXSolicitudDTO[] = item as any;
this.getCantidad = castItem;
return castItem?.map((s) => {
const namePatient = nombresPacientes[s.correlativo_solicitud];
return new EmailAttachment().setFromBufferXSolicitudDTO(s, namePatient, 'pdf');
});
} else {
return [new EmailAttachment().setFromBufferCorrelativosDTO(item as any, 'pdf')];
}
}
Thank you very much for your attention, I appreciate it. Cheers
You could try using lodash as this has _.intersectionBy and _.intersectionWith functions that should allow you to compare 2 arrays and filter the common values.
There are some good examples here:
How to get intersection with lodash?

Wait for server response with axios from different file React

I have a loop. On each round I need to add Question data into MongoDB database. This works fine. However, I want to get _id of the new inserted Question before the loop goes into the next round. This is where I have a problem. It takes certain amount of time before the server returns _id and loop goes to the next round by that time. Therefore, I need a way to wait for the server response and only after that move to the next round of the loop.
Here is my back-end code:
router.post("/createQuestion", (req, res) => {
const newQuestion = new Question({
description: req.body.description,
type: req.body.type,
model: req.body.model
});
newQuestion.save().then(question => res.json(question._id))
.catch(err => console.log(err));
});
Here is my axios function, which is in a separate file and imported into the class:
export const createQuestion = (questionData) => dispatch => {
axios.post("/api/scorecard/createQuestion", questionData)
.then(res => {
return res.data;
}).catch(err =>
console.log("Error adding a question")
);
};
Here is my code inside my class:
JSON.parse(localStorage.getItem(i)).map(question => {
const newQuestion = {
description: question.description,
type: question.questionType,
model: this.props.model
}
const question_id = this.props.createQuestion(newQuestion);
console.log(question_id);
}
Console shows undefined.
i faced the same issue i solved the same by sending the array question to the node and read one by one question and update with the next Question ID.
router.post("/createQuestion", (req, res) => {
let d =[questionarray];
let i = 0;
let length = d.length;
var result = [];
try {
const timeoutPromise = (timeout) => new Promise((resolve) => setTimeout(resolve, timeout));
for (i = 0; i < length; i++) {
await timeoutPromise(1000); // 1000 = 1 second
let CAT_ID = parseInt(d[i].CAT_ID);
let TOPIC_ID = parseInt(d[i].TOPIC_ID);
let Q_DESC = (d[i].Q_DESC);
let OPT_1 = (d[i].OPT_1);
let OPT_2 = (d[i].OPT_2);
let OPT_3 = (d[i].OPT_3);
let OPT_4 = (d[i].OPT_4);
let ANS_ID = (d[i].ANS_ID);
let TAGS = (d[i].TAGS);
let HINT = (d[i].HINT);
let LEVEL = d[i].LEVEL;
let SRNO = d[i].SrNo;
let qid;
const savemyData = async (data) => {
return await data.save()
}
var myResult = await Question.find({ TOPIC_ID: TOPIC_ID }).countDocuments(function (err, count) {
if (err) {
console.log(err);
}
else {
if (count === 0) {
qid = TOPIC_ID + '' + 10001;
const newQuestion = new Question({
Q_ID: qid,
CAT_ID: CAT_ID,
TOPIC_ID: TOPIC_ID,
Q_ID: qid,
Q_DESC: Q_DESC,
OPT_1: OPT_1,
OPT_2: OPT_2,
OPT_3: OPT_3,
OPT_4: OPT_4,
ANS_ID: ANS_ID,
HINT: HINT,
TAGS: TAGS,
LEVEL: LEVEL,
Q_IMAGE: ''
})
await savemyData(newQuestion)
.then(result => { return true })
.catch(err => { return false });
//`${SRNO} is added successfully`
//`${SRNO} is Failed`
}
else if (count > 0) {
// console.log(count)
Question.find({ TOPIC_ID: TOPIC_ID }).sort({ Q_ID: -1 }).limit(1)
.then(question => {
qid = question[0].Q_ID + 1;
const newQuestion = new Question({
Q_ID: qid,
CAT_ID: CAT_ID,
TOPIC_ID: TOPIC_ID,
Q_ID: qid,
Q_DESC: Q_DESC,
OPT_1: OPT_1,
OPT_2: OPT_2,
OPT_3: OPT_3,
OPT_4: OPT_4,
ANS_ID: ANS_ID,
HINT: HINT,
TAGS: TAGS,
LEVEL: LEVEL,
Q_IMAGE: ''
})
await savemyData(newQuestion)
.then(result => { return true })
.catch(err => { return false });
})
.catch(err => console.log(err));
}
}
});
if (myResult)
result.push(`${SRNO} is added successfully`);
else
result.push(`${SRNO} is Failed`);
}
// console.log(result)
return res.json(result);
}
catch (err) {
//res.status(404).json({ success: false })
console.log(err)
}
});
First your function createQuestion doesn't return a value so the assigning to question_id would always be undefined. Anyways, since u have a dispatch in your createQuestion function, I am assuming u r using redux, so I would suggest you to using redux-thnk, split the fetching new action logic to a thunk action, and use the questionID value from the redux state rather than returning a value from createQuestion. In your class u can be listening for a change of the questionID and if that happens, dispatch the saving of the next question.

How to use promise and loop over mongoose collection

I'm making chat inside my website. To store data I use Chat, User, Messages collections.
I want results to be in Array containing:
[{
username (another one, not me)
last update
last message
}]
In Chat model I have only chatid and array of two members, so I need to loop through User collection to get user name using user id from it. I want to save in array all names (in future I would also like to loop through messages to get latest messages for each chatid). Issue is that when I return chatsList it is empty. I think I need somehow to use Promise, but I'm not completely sure how it should work.
Chat.find({ members: userId })
.then(chats => {
let chatsList = [];
chats.forEach((chat, i) => {
let guestId = chat.members[1 - chat.members.indexOf(userId)];
User.findOne({ _id: guestId })
.then(guest => {
let chatObj = {};
name = guest.name;
chatsList.push(name);
console.log("chatsList", chatsList)
})
.catch(err => console.log("guest err =>", err))
})
return res.json(chatsList)
})
.catch(err => {
errors.books = "There are no chats for this user";
res.status(400).json(errors);
})
Indeed, Promise.all is what you are looking for:
Chat.find({ members: userId })
.then(chats => {
let userPromises = [];
chats.forEach((chat, i) => {
let guestId = chat.members[1 - chat.members.indexOf(userId)];
userPromises.push(User.findOne({ _id: guestId }));
});
return Promise.all(userPromises).then(guests => {
let chatsList = [];
guests.forEach(guest => {
chatsList.push(guest.name);
});
return res.json(chatsList);
});
});
});
although it would probably be better to do a single call to DB with a list of ids ($in query). Something like this:
Chat.find({ members: userId })
.then(chats => {
let ids = [];
chats.forEach((chat, i) => {
let guestId = chat.members[1 - chat.members.indexOf(userId)];
ids.push(guestId);
});
return User.find({_id: {$in: ids}}).then(guests => {
let chatsList = [];
guests.forEach(guest => {
chatsList.push(guest.name);
});
return res.json(chatsList);
});
});
});
You may want to additionally validate if every id had a corresponding guest.
You are running into concurrency issues. For example, running chats.forEach, and inside forEach running User.findOne().then: The return statement is already executed before the User.findOne() promise has resolved. That's why your list is empty.
You could get more readable and working code by using async/await:
async function getChatList() {
const chats = await Chat.find({members: userId});
const chatsList = [];
for (const chat of chats) {
let guestId = chat.members[1 - chat.members.indexOf(userId)];
const guest = await User.findOne({_id: guestId});
chatsList.push(guest.name);
}
return chatsList;
}
Then the code to actually send the chat list back to the user:
try {
return res.json(await getChatList());
} catch (err) {
// handle errors;
}
You can try this:
Chat.find({ members: userId }).then(chats => {
let guestHashMap = {};
chats.forEach(chat => {
let guestId = chat.members.filter(id => id != userId)[0];
// depending on if your ID is of type ObjectId('asdada')
// change it to guestHashMap[guestId.toString()] = true;
guestHashMap[guestId] = true;
})
return Promise.all(
// it is going to return unique guests
Object.keys(guestHashMap)
.map(guestId => {
// depending on if your ID is of type ObjectId('asdada')
// change it to User.findOne({ _id: guestHashMap[guestId] })
return User.findOne({ _id: guestId })
}))
})
.then(chats => {
console.log(chats.map(chat => chat.name))
res.json(chats.map(chat => chat.name))
})
.catch(err => {
errors.books = "There are no chats for this user";
res.status(400).json(errors);
})

store each api response in an array in React

I am using metaweather.com API to build a Web Application. I want to show 6 cities on the home page; I guess I have to call the API 6 time and push the data in an array like allCitiesDetails. How I have to do that? If there is a better way, please tell me. Here is my code :
state = {
city: {
cityname: this.props.value
},
woeid: '',
todayWeather: [],
weatherDetails: [],
allCitiesDetails: []
};
getCity = (cityName) => {
var self = this;
axios
.get('https://www.metaweather.com/api/location/search/?query=' + cityName)
.then(response => {
self.setState({
woeid: response.data[0].woeid
});
self.getWeather(response.data[0].woeid);
})
.catch(function(error) {
alert('No results were found. Try changing the keyword!');
});
}
getWeather = async (woeid) => {
const { data: weatherDetails } = await axios.get(
'https://www.metaweather.com/api/location/' + woeid
);
this.setState({
weatherDetails,
todayWeather: weatherDetails.consolidated_weather[0]
});
}
You should make 6 different promises and use Promise.all to get the weather of all 6 cities in parallel. You can do this as :
const getWeatherFromWoeid = cityName => axios.get(`https://www.metaweather.com/api/location/${woeid}`);
....
const p1 = getWeatherFromWoeid(woeid1);
const p2 = getWeatherFromWoeid(woeid2);
const p3 = getWeatherFromWoeid(woeid3);
const p4 = getWeatherFromWoeid(woeid4);
const p5 = getWeatherFromWoeid(woeid5);
const p6 = getWeatherFromWoeid(woeid6);
Promise.all([p1,p2,p3,p4,p5,p6])
.then(([result1, result2, result3, result4, result5, result6]) => {
...set result in the state
})
.catch((err) => {
...handle error
})
Also, always use catch if you're using promises or async
instead of using state inside the api call...
self.setState({
woeid: response.data[0].woeid
});
you can push the values in dummy array then outside the api call u can set state.

Using only one API if another one does not have data

const getUser = async (user) => {
const body = await snekfetch.get('https://www.website.com/api/public/users?name=' + user);
const userInfo = JSON.parse(body.text);
const r = await snekfetch.get('https://www.website.com/api/public/users/' + userInfo.uniqueId + '/profile');
const extraUserInfo = JSON.parse(r.text);
const _message = await client.users.get('437502925019807744').send({ files: ['https://www.website.nl/avatar-imaging/avatarimage?figure=h' + userInfo.figureString + '.png'] });
const avatarImage = _message.attachments.first().url;
return { userInfo, extraUserInfo, avatarImage };
};
getUser(args[0]).then((result) => {
message.channel.send(`${result.userInfo.name}`);
}).catch(function(result) {
console.log(result.userInfo.name);
});
Here I am trying to use 3 API's, however it always goes to the catch, even if one exists and other don't, I tried to only use result.userInfo.name to only use the first API, also in the catch I use the first one, then I tried a name that only has the first API but not the second one however I still get: TypeError: Cannot read property 'name' of undefined because it looks at the second API as well, what else can I do to handle with this situation? So basically how can I only catch errors for the first API
edit: I also tried:
if (extraUserInfo.user.name) {return { userInfo, extraUserInfo, avatarImage };}
else {return { userInfo, avatarImage };}
Fixed it with a try catch
try {
const r = await snekfetch.get('https://www.website.com/api/public/users/' + userInfo.uniqueId + '/profile');
const extraUserInfo = JSON.parse(r.text);
return {
userInfo,
extraUserInfo,
avatarImage
};
} catch (error) {
const extraUserInfo = {
'error': 'not-found'
};
return {
userInfo,
extraUserInfo,
avatarImage
};
}
};
getUser(args[0]).then((result) => {
console.log(result.extraUserInfo.error === 'not-found' ? result.userInfo.name : result.extraUserInfo.user.name);
}).catch((error) => {
console.log(error);
});

Categories

Resources