I'm trying to archive a webpage by the URL. the idea is to archive a website into a single HTML file with all the assets in it. when the user provides URL, I'm using fetch to get HTML content from the webpage of the URL. however I wanted to replace dependent asset paths(ie CSS files paths, href, file/URL paths) to user-provided URL. so that when I open archived HTML file I can see page rendered properly with all the images, links, etc.
Here's what I'm trying:
const response = await fetch(url);
const html = await response.text();
// replace hrefs with absolute urls
const newHTML = html.replaceAll(/href="\//g, 'href="https://example.com/');
fs.writeFile('output2.html', newHTML, (err) => {
if (err) throw err;
console.log('The file has been saved!');
});
I need help with finding the proper regexp in order to make it work. or any other way around. the idea is to archive a website into a single HTML file with all the assets in it.
Related
I'm new to Web Scraping, I'm using Axios to fetch the URL, and then access the data with Cheerio.
I want to web scrape twitter by getting my account's number of followers, I inspected the element who holds the number of followers, then tried to execute it, but it doesn't return anything
So I tried to execute each span tag in the page, and it returns the string "Something went wrong, but don’t fret — let’s give it another shot."
When I inspect the page, I can see the tag elements, but when I click on "view page source", it shows a totally different thing.
I found that the string "Something went wrong, but don’t fret — let’s give it another shot." is located in the page source here:
The element I want when inspecting my twitter page is:
This is my JS code:
const cheerio = require('cheerio');
const axios = require('axios')
axios('https://twitter.com/SaudAlghamdi97')
.then(response => {
run();
async function run() {
const html = await response.data;
const $ = cheerio.load(html);
$('span').each((i, el) => {
console.log($(el).text());
});
}
})
This is what I get in the terminal:
am I missing something here? I'm struggling to scrape the number of followers.
The data you request seems to be rendered by Javascript. You'll need another library for example puppeteer, which will be able to view the rendered page like when you see it in your browser.
"Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol"
This works fine:
fetch("./xml/Stories.xml").then((response) => {
response.text().then((xml) => {
xmlContent = xml;
But I would like to get my data from a website which has a link that only displays the xml, how would I go about retrieving the data through a link instead of a direct file path?
I.E:
fetch("https://Example.com/").then((response) => {
response.text().then((xml) => {
xmlContent = xml;
What you are trying to do is called Web scraping. You have to scrape out the link you need from the webpage before you try to fetch it's XML content.
While scraping is generally a bad idea, it is certainly possible. Find out a pattern or a element ID / class name which is present on the element containing the link and use a HTTP request to first fetch the web page's HTML content:
const request = require('request');
request('http://stackabuse.com', function(err, res, body) {
console.log(body); // The HTML content
});
Then you use a HTML parser library like cheerio to turn the raw HTML string into traversable objects and get the link you need; fetch that link and you have your XML content.
The downside to web scraping is that if the owner of the webpage decides to edit his HTML content, it will probably make your web scraper stop working because the pattern you matched for the link will no longer be valid.
I'm trying to export a certain page from my Angular/nodeJs application using "pdfmake" and have it show up as a download option on the same page after having heard that the best way to export pdf's is through the back-end. After reading the guide and following a tutorial, however, the code writes data to my header but doesn't appear to do anything else.
In the past I've tried following the tutorial below and have read through the method documentation of pdfmake.
https://www.youtube.com/watch?v=0910p09D0Sg
https://pdfmake.github.io/docs/getting-started/client-side/methods/
I'm uncertain whether pdfmake is only supposed to be used by 'headless chrome' (of which I don't possess much knowledge) and wonder if my method can work.
I've also tried using the .download() and .open() functions with pdfMake.createPdf() which resulted in errors.
NodeJs code
router.post('/pdf', (req, res, next) => {
var documentDefinition = {
content: [
'First paragraph',
'Another paragraph, this time a little bit longer to make sure, this line will be divided into at least two lines'
]
}
const pdfDoc = pdfMake.createPdf(documentDefinition)
pdfDoc.getBase64((data) => {
res.writeHead(200, {
'Content-Type': 'application/pdf',
'Content-Disposition': 'attachment;filename="filename.pdf"'
});
const download = Buffer.from(data.toString('utf-8'), 'base64');
res.end(download);
})
})
Angular code
savePDF() {
this.api.post('/bookings/pdf')
.then(res => {
console.log(res);
});
}
In this case the savePDF() function is called when the user clicks on a button on the web page.
Because nothing was happening upon clicking the button I decided to console.log the result which showed up as a very long string of data.
The pdf document data only contains testdata for now as I was trying to get a download link to work before trying to download the webpage itself.
I can also assure you that there is nothing wrong with the routing and the functions are called properly.
I expected the savePDF() function to start a download of a pdf containing the test data shown in the NodeJs "documentDefinition" content, but the actual result did seemingly nothing.
essentially, I'm trying to retrieve the contents of a website I created to display time-based one-time passcodes as part of a project I'm doing for fun. the site in question is here. as you can see, all it does is display a totp. however, I'm having trouble actually getting the data from the site.
I've created a small script that is meant to get the totp from the web page, but upon running a fetch request like this (from another server):
const getResponseFromTOTP = () =>
new Promise((resolve, reject) => {
try {
fetch("https://jpegzilla.com/totp/?secret=secrettextgoeshere&length=6")
.then(res => res.text())
.then(html => resolve(html));
} catch (e) {
reject(e);
}
});
from this request, I get the entirety of the document at the url -- but without the content that would be rendered there if viewed by a browser.
the idea is to somehow get javascript to render the content of the webpage as it would be displayed if viewed by a browser, and then extract the totp from the document. the site hosting the totp is completely static; how might this be achieved using only javascript and html?
I'm pulling a base64 encoded image from a server to a website and want to automatically use the image. The image comes within an object file and I'm not sure how to extract just the image from the object.
Server send the object using (Node.js server):
res.send(result[0]);
The website get request and display code:
getImage() {
this.httpClient.get('http://ip_address/route/imagename')
.subscribe(data => {
this.getImg = data;
console.log(data);
});
The result is an object containing the image, but I want the image to display on its own, outside of the object. I don't know if that needs happens on the server, or on the website, but i dont know either way..
console log image on website link
I needed to save the data file to a variable then save it to a different variable, specifying the part I wanted in the first variable. Like:
this.httpClient.get('ip_address/images/' + imgName)
.subscribe(data => {
this.imageData = data;
this.base64image = (this.imageData.image);
this.base64image can then be used as the base64 code.