Download website locally without Javascript using puppeteer - javascript

I am trying to download a website as static, I mean without JS, only HTML & CSS.
I've tried many approaches yet some issues still present regarding CSS and Images.
A snippet
const puppeteer = require('puppeteer');
const {URL} = require('url');
const fse = require('fs-extra');
const path = require('path');
(async (urlToFetch) => {
const browser = await puppeteer.launch({
headless: true,
slowMo: 100
});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "script") {
request.abort()
} else {
request.continue()
}
})
page.on('response', async (response) => {
const url = new URL(response.url());
let filePath = path.resolve(`./output${url.pathname}`);
if(path.extname(url.pathname).trim() === '') {
filePath = `${filePath}/index.html`;
}
await fse.outputFile(filePath, await response.buffer());
console.log(`File ${filePath} is written successfully`);
});
await page.goto(urlToFetch, {
waitUntil: 'networkidle2'
})
setTimeout(async () => {
await browser.close();
}, 60000 * 4)
})('https://stackoverflow.com/');
I've tried using
content = await page.content();
fs.writeFileSync('index.html', content, { encoding: 'utf-8' });
As well as, I download it using CDPSession.
I've tried it using website-scraper
So what is the best approach to come to a solution where I provide a website link, then It downloads it as static website.

Try using this https://www.npmjs.com/package/website-scraper
It will save the website into a local directory

Have you tried something like wget or curl?
wget -p https://stackoverflow.com/questions/67559777/download-website-locally-without-javascript-using-puppeteer
Should do the trick

Related

Puppeteer not actually downloading ZIP despite Clicking Link

I've been making incremental progress, but I'm fairly stumped at this point.
This is the site I'm trying to download from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
The reason I'm using Puppeteer is because I can't find a supported API to get this data (if there is one happy to try it)
The link is "Download Raw Data"
My script runs to the end, but doesn't seem to actually download any files. I tried installing puppeteer-extra and setting the downloads path:
const puppeteer = require("puppeteer-extra");
const { executablePath } = require('puppeteer')
...
var dir = "/home/ubuntu/AirlineStatsFetcher/downloads";
console.log('dir to set for downloads', dir);
puppeteer.use(require('puppeteer-extra-plugin-user-preferences')
(
{
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: true,
default_directory: dir,
},
plugins: {
always_open_pdf_externally: true
},
}
}));
const browser = await puppeteer.launch({
headless: true, slowMo: 100, executablePath: executablePath()
});
...
// Doesn't seem to work
await page.waitForSelector('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
console.log('Clicking on link to download CSV');
await page.click('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
After a while I figured why not tried to build the full URL and then do a GET request but then i run into other problems (UNABLE_TO_VERIFY_LEAF_SIGNATURE). Before going down this route farther (which feels a little hacky) I wanted to ask advice here.
Is there something I'm missing in terms of configuration to get it to download?
Downloading files using puppeteer seems to be a moving target btw not well supported today. For now (puppeteer 19.2.2) I would go with https.get instead.
"use strict";
const fs = require("fs");
const https = require("https");
// Not sure why puppeteer-extra is used... maybe https://stackoverflow.com/a/73869616/1258111 solves the need in future.
const puppeteer = require("puppeteer-extra");
const { executablePath } = require("puppeteer");
(async () => {
puppeteer.use(
require("puppeteer-extra-plugin-user-preferences")({
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: false,
},
plugins: {
always_open_pdf_externally: false,
},
},
})
);
const browser = await puppeteer.launch({
headless: true,
slowMo: 100,
executablePath: executablePath(),
});
const page = await browser.newPage();
await page.goto(
"https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp ",
{
waitUntil: "networkidle2",
}
);
const handle = await page.$(
"table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)"
);
const relativeZipUrl = await page.evaluate(
(anchor) => anchor.getAttribute("href"),
handle
);
const url = "https://www.transtats.bts.gov/OT_Delay/".concat(relativeZipUrl);
const encodedUrl = encodeURI(url);
//Don't use in production
https.globalAgent.options.rejectUnauthorized = false;
https.get(encodedUrl, (res) => {
const path = `${__dirname}/download.zip`;
const filePath = fs.createWriteStream(path);
res.pipe(filePath);
filePath.on("finish", () => {
filePath.close();
console.log("Download Completed");
});
});
await browser.close();
})();

How do I combine puppeteer plugins with puppeteer clusters?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?
You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions
You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();

Chrome download error when downloading file with Puppeteer

I have an application that shows a page, the user clicks on a button, and downloads a CSV file. I want to run this with Puppeteer.
Problem is that the CSV is downloaded empty and with an error. This happens both with headless true and false. The page finished loading, and I increased the timeout, but it still fails. What could be the issue?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('http://localhost:4400/login', { waitUntil: 'networkidle2' });
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: './',
});
await page.waitForSelector('#run-and-export');
await page.click('#run-and-export');
// file-downloaded is turned on when the file finished downloading (not to close the window)
await page.waitForSelector('#file-downloaded', { timeout: 120000 });
await browser.close();
})();
The code in the application that generates the file to download is an Angular service:
#Injectable({
providedIn: 'root'
})
export class DownloadService {
downloadFile(content:any, fileName: string, mimeType: string){
var blob = new Blob([(content)], {type: mimeType});
var a = document.createElement('a');
a.href = window.URL.createObjectURL(blob);
a.download = fileName;
a.click();
}
}
This is what made this work:
const downloadPath = path.resolve('/my/path');
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: downloadPath
});
I got the same problem, the download failed, in the download dir I got filename.pdf.crdownload and no other file.
The download dir is up two levels from the app dir ../../download_dir
The solution was (as suggested by ps0604) :
const path = require('path');
const download_path = path.resolve('../../download_dir/');
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
userDataDir: './',
downloadPath: download_path,
});
If someone is searching for .crdownload file and download error.

How to use Apify on Google Cloud Functions

I'm deploying some code using Apify as Google Cloud Functions. When triggered, the Cloud Function terminates silently. What am I doing wrong?
I have some working code using Apify 0.15.1. It runs fine locally. Once deployed as a Google Cloud Function, it fails silently without any clear error. The equivalent code using Puppeteer 1.18.1 works fine.
I've reproduced the issue using more simple code below. While this example doesn't strictly require Apify, I would like to be able to use the extra functionality provided by Apify.
Code using Apify:
const Apify = require("apify");
exports.screenshotApify = async (req, res) => {
let imageBuffer;
Apify.main(async () => {
const browser = await Apify.launchPuppeteer({ headless: true });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
imageBuffer = await page.screenshot({ fullPage: true });
await browser.close();
});
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Code using Puppeteer:
const puppeteer = require("puppeteer");
exports.screenshotPup = async (req, res) => {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
const imageBuffer = await page.screenshot({ fullpage: true });
await browser.close();
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Once deployed as a Google Cloud Function (with --trigger-http and --memory=2048), the Puppeteer variant works fine, while the Apify variant terminates silently without result (apart from an 'ok' / HTTP 200 return value).
Get rid of the Apify.main() function, it schedules the call to a later time, after your function already returned the result.

Set localstorage items before page loads in puppeteer?

We have some routing logic that kicks you to the homepage if you dont have a JWT_TOKEN set... I want to set this before the page loads/before the js is invoked.
how do i do this ?
You have to register localStorage item like this:
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
You should do it after page page.goto - browser must have an url to register local storage item on it. After this, enter the same page once again, this time token should be here before the page is loaded.
Here is a fully working example:
const puppeteer = require('puppeteer');
const http = require('http');
const html = `
<html>
<body>
<div id="element"></div>
<script>
document.getElementById('element').innerHTML =
localStorage.getItem('token') ? 'signed' : 'not signed';
</script>
</body>
</html>`;
http
.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.write(html);
res.end();
})
.listen(8080);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080/');
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
await page.goto('http://localhost:8080/');
const text = await page.evaluate(
() => document.querySelector('#element').textContent
);
console.log(text);
await browser.close();
process.exit(0);
})();
There's some discussion about this in Puppeteer's GitHub issues.
You can load a page on the domain, set your localStorage, then go to the actual page you want to load with localStorage ready. You can also intercept the first url load to return instantly instead of actually load the page, potentially saving a lot of time.
const doSomePuppeteerThings = async () => {
const url = 'http://example.com/';
const browser = await puppeteer.launch();
const localStorage = { storageKey: 'storageValue' };
await setDomainLocalStorage(browser, url, localStorage);
const page = await browser.newPage();
// do your actual puppeteer things now
};
const setDomainLocalStorage = async (browser, url, values) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.',
});
});
await page.goto(url);
await page.evaluate(values => {
for (const key in values) {
localStorage.setItem(key, values[key]);
}
}, values);
await page.close();
};
in 2021 it work with following code:
// store in localstorage the token
await page.evaluateOnNewDocument (
token => {
localStorage.clear();
localStorage.setItem('token', token);
}, 'eyJh...9_8cw');
// open the url
await page.goto('http://localhost:3000/Admin', { waitUntil: 'load' });
The next line from the first comment does not work unfortunately
await page.evaluate(() => {
localStorage.setItem('token', 'example-token'); // not work, produce errors :(
});
Without requiring to double goTo this would work:
const browser = await puppeteer.launch();
browser.on('targetchanged', async (target) => {
const targetPage = await target.page();
const client = await targetPage.target().createCDPSession();
await client.send('Runtime.evaluate', {
expression: `localStorage.setItem('hello', 'world')`,
});
});
// newPage, goTo, etc...
Adapted from the lighthouse doc for puppeteer that do something similar: https://github.com/GoogleChrome/lighthouse/blob/master/docs/puppeteer.md
Try and additional script tag. Example:
Say you have a main.js script that houses your routing logic.
Then a setJWT.js script that houses your token logic.
Then within your html that is loading these scripts order them in this way:
<script src='setJWT.js'></script>
<script src='main.js'></script>
This would only be good for initial start of the page.
Most routing libraries, however, usually have an event hook system that you can hook into before a route renders. I would store the setJWT logic somewhere in that callback.

Categories

Resources