Cheerio web scraping Twitter loading different data - javascript

I'm new to Web Scraping, I'm using Axios to fetch the URL, and then access the data with Cheerio.
I want to web scrape twitter by getting my account's number of followers, I inspected the element who holds the number of followers, then tried to execute it, but it doesn't return anything
So I tried to execute each span tag in the page, and it returns the string "Something went wrong, but don’t fret — let’s give it another shot."
When I inspect the page, I can see the tag elements, but when I click on "view page source", it shows a totally different thing.
I found that the string "Something went wrong, but don’t fret — let’s give it another shot." is located in the page source here:
The element I want when inspecting my twitter page is:
This is my JS code:
const cheerio = require('cheerio');
const axios = require('axios')
axios('https://twitter.com/SaudAlghamdi97')
.then(response => {
run();
async function run() {
const html = await response.data;
const $ = cheerio.load(html);
$('span').each((i, el) => {
console.log($(el).text());
});
}
})
This is what I get in the terminal:
am I missing something here? I'm struggling to scrape the number of followers.

The data you request seems to be rendered by Javascript. You'll need another library for example puppeteer, which will be able to view the rendered page like when you see it in your browser.
"Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol"

Related

Javascript, fetch a url instead of a direct file path

This works fine:
fetch("./xml/Stories.xml").then((response) => {
response.text().then((xml) => {
xmlContent = xml;
But I would like to get my data from a website which has a link that only displays the xml, how would I go about retrieving the data through a link instead of a direct file path?
I.E:
fetch("https://Example.com/").then((response) => {
response.text().then((xml) => {
xmlContent = xml;
What you are trying to do is called Web scraping. You have to scrape out the link you need from the webpage before you try to fetch it's XML content.
While scraping is generally a bad idea, it is certainly possible. Find out a pattern or a element ID / class name which is present on the element containing the link and use a HTTP request to first fetch the web page's HTML content:
const request = require('request');
request('http://stackabuse.com', function(err, res, body) {
console.log(body); // The HTML content
});
Then you use a HTML parser library like cheerio to turn the raw HTML string into traversable objects and get the link you need; fetch that link and you have your XML content.
The downside to web scraping is that if the owner of the webpage decides to edit his HTML content, it will probably make your web scraper stop working because the pattern you matched for the link will no longer be valid.

Can I access elements from a web page with JavaScript?

I'm making a Discord bot in JavaScript and implementing a feature where when you ask a coding question it gives you a snippet. I'm using Grepper and returning the url with the search results. For example:
Hello World in JavaScript Search Results. I would like to access the div containing the snippet. Is this possible? And how would I do it?
Here's my code:
if (message.startsWith('programming')) {
// Command = programming
message = message.replace('programming ', ''); // Remove programming from the string
message = encodeURIComponent(message) // Encode the string for a url
msg.channel.send(`https://www.codegrepper.com/search.php?answer_removed=1&q=${message}`); // Place formatted string into url and send it to the discord server
// Here the program should access the element containing the snippet instead of sending the url:
}
I'm new to JavaScript so sorry if this is a stupid question.
As far as I know the API you are using returns HTML/Text data, not JSON, Grepper has a lot more APIs if you just look into them, you can instead use this API that returns JSON data. If you need more information you can check this Unofficial List of Grepper APIs
https://www.codegrepper.com/api/get_answers_1.php?v=2&s=${SearchQuery}&u=98467
How Do I Access the div containing the snippet?
To access the div you might need to use python web scraping to scrape the innerHTML of the div but I think it's easier to use the other API.
Or
You can put /api/ in the url like:
https://www.codegrepper.com/api/search.php?answer_removed=1&q=js%20loop
The easiest way for this is to send a GET request to the underlying API
https://www.codegrepper.com/api/search.php?q=hello%20world%20javascript&search_options=search_titles
This will return the answers in JSON format. Obviously you'd have to adjust the parameters.
How did I find out about this?
Simply look at the network tab of your browser's dev tools while loading the page. You'll see a GET request being sent out to the endpoint, returning mentioned answers as JSON.
The best way is to use the grepper api.
Install node-fetch
npm i node-fetch, You need this package for making requestes to the api.
To import It in the code just type:
const fetch = require('node-fetch');
Write this code
Modify your code like this:
if (message.startsWith('programming')) {
message = message.replace('programming ', '');
message = encodeURIComponent(message)
// Making the request
fetch(`https://www.codegrepper.com/api/search.php?answer_removed=1&q=${message}`)
.then(res => res.json())
.then(response => {
// response is a json object containing all the data You need
// now You need to parse this data
const answers = response.answers; // this is an array of objects
const answers_limit = 3; // set a limit for the answers
// cheking if there is any answer
if(answers.length == 0) {
return msg.channel.send("No answers were found!");
}
// creating the embed
const embed = new Discord.MessageEmbed()
.setTitle("Here the answers to your question!")
.setDescription("")
// parsing
for(let i = 0; i < answers_limit; i++) {
if(answers[i]) {
embed.description += `**${i+1}° answer**:\n\`\`\`js\n${answers[i].answer}\`\`\`\n`;
}
}
console.log(embed)
msg.channel.send(embed);
});
}

Data VS Async Data in Nuxt

Im using vue.js with nuxt.js, I'm just still confused as when to use Data VS Async Data. Why would I need to use Async data when I just have data that just displays on the page?
I have a data object of FAQ's and just want to display the data without doing anything with it. What are the benefits of using the asyncData? Or what are the cases or best use of them?
Should I display list data such as this as async by default if using data such as this inside of my component?
Data
data:() => ({
faqs:[
{"title":"faq1"},
{"title":"faq2"},
{"title":"faq3"},
]
}),
asyncData
asyncData(context) {
return new Promise((resolve, reject) => {
resolve({
colocationFaqs:[
{"title":"faq1"},
{"title":"faq2"},
{"title":"faq3"},
]
});
})
.then(data => {
return data
})
.catch(e => {
context.error(e);
});
},
asyncData happes on the serer-side. You cant access browser things like localStorage or fetch() for example but on the ther hand you can access server-side things.
So why should you use asyncData instead of vue cycles like created?
The benefit to use asyncData is SEO and speed. There is this special context argument. It contains things like your store with context.store. Its special because asyncData happens on server-side but the store is on the client side usually. That means you can get some data and then populate your store with it and somewhere else you display it. The benefit of this is that its all server-side and that increase your SEO so for example the google crawler doesnt see a blank page
why would I need to pre render it when it is going to be displayed
anyway
Yes for us it doesnt matter if i send 1 File to the client and it renders all data like in SPA's or if its pre-rendered. But it doesnt matter for the google crawler. If you use SPA mode the crawler just sees a blank page. You can discoverd it too. Go to any SPA website and click right-click and inspect you will see thats there only 1 Div tag and few <script> tags. (Dont press F12 and inspect like this thats not what i mean).

Recursive Facebook Page Webscraper with Selenium & Node.js

What I try to do is to loop through an array of Facebook page IDs and to return the code from each event page. Unfortunately, I only get the code of the last page ID in the array but as many times as elements are in the array. E.g. when I have 3 ID's in the array I get 3 times the code of the last page ID.
I already experimented with async await but I had no success.
The expected outcome would be the code of each page.
Thank you for any help and examples.
//Looping through pages
pages.forEach(
function(page) {
//Creating URL
let url = "https://mbasic.facebook.com/"+page+"?v=events";
//Getting URL
driver.get(url).then(
function() {
//Page loaded
driver.getPageSource().then(function(result) {
console.log(result);
});
}
);
}
);
you faced the same issue i did when i created a scraper using python and selenium. Facebook has countermeasure on manual URL change, you cannot change it , i receive the same data again and again even though it was automated. in order to get a good result you need to have access of face books Graph API which provides a complete object of Facebook page with its pagination URL.
or the second way i got it write was i used on click button of selenium browser automation to scroll down the next page.it wont work like you are typing , i prefer the usage of graph API

Is there a way to export and download your Angular front-end web page to pdf using back-end nodeJs's pdfmake?

I'm trying to export a certain page from my Angular/nodeJs application using "pdfmake" and have it show up as a download option on the same page after having heard that the best way to export pdf's is through the back-end. After reading the guide and following a tutorial, however, the code writes data to my header but doesn't appear to do anything else.
In the past I've tried following the tutorial below and have read through the method documentation of pdfmake.
https://www.youtube.com/watch?v=0910p09D0Sg
https://pdfmake.github.io/docs/getting-started/client-side/methods/
I'm uncertain whether pdfmake is only supposed to be used by 'headless chrome' (of which I don't possess much knowledge) and wonder if my method can work.
I've also tried using the .download() and .open() functions with pdfMake.createPdf() which resulted in errors.
NodeJs code
router.post('/pdf', (req, res, next) => {
var documentDefinition = {
content: [
'First paragraph',
'Another paragraph, this time a little bit longer to make sure, this line will be divided into at least two lines'
]
}
const pdfDoc = pdfMake.createPdf(documentDefinition)
pdfDoc.getBase64((data) => {
res.writeHead(200, {
'Content-Type': 'application/pdf',
'Content-Disposition': 'attachment;filename="filename.pdf"'
});
const download = Buffer.from(data.toString('utf-8'), 'base64');
res.end(download);
})
})
Angular code
savePDF() {
this.api.post('/bookings/pdf')
.then(res => {
console.log(res);
});
}
In this case the savePDF() function is called when the user clicks on a button on the web page.
Because nothing was happening upon clicking the button I decided to console.log the result which showed up as a very long string of data.
The pdf document data only contains testdata for now as I was trying to get a download link to work before trying to download the webpage itself.
I can also assure you that there is nothing wrong with the routing and the functions are called properly.
I expected the savePDF() function to start a download of a pdf containing the test data shown in the NodeJs "documentDefinition" content, but the actual result did seemingly nothing.

Categories

Resources