Chrome download error when downloading file with Puppeteer - javascript

I have an application that shows a page, the user clicks on a button, and downloads a CSV file. I want to run this with Puppeteer.
Problem is that the CSV is downloaded empty and with an error. This happens both with headless true and false. The page finished loading, and I increased the timeout, but it still fails. What could be the issue?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('http://localhost:4400/login', { waitUntil: 'networkidle2' });
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: './',
});
await page.waitForSelector('#run-and-export');
await page.click('#run-and-export');
// file-downloaded is turned on when the file finished downloading (not to close the window)
await page.waitForSelector('#file-downloaded', { timeout: 120000 });
await browser.close();
})();
The code in the application that generates the file to download is an Angular service:
#Injectable({
providedIn: 'root'
})
export class DownloadService {
downloadFile(content:any, fileName: string, mimeType: string){
var blob = new Blob([(content)], {type: mimeType});
var a = document.createElement('a');
a.href = window.URL.createObjectURL(blob);
a.download = fileName;
a.click();
}
}

This is what made this work:
const downloadPath = path.resolve('/my/path');
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: downloadPath
});

I got the same problem, the download failed, in the download dir I got filename.pdf.crdownload and no other file.
The download dir is up two levels from the app dir ../../download_dir
The solution was (as suggested by ps0604) :
const path = require('path');
const download_path = path.resolve('../../download_dir/');
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
userDataDir: './',
downloadPath: download_path,
});
If someone is searching for .crdownload file and download error.

Related

Puppeteer not actually downloading ZIP despite Clicking Link

I've been making incremental progress, but I'm fairly stumped at this point.
This is the site I'm trying to download from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
The reason I'm using Puppeteer is because I can't find a supported API to get this data (if there is one happy to try it)
The link is "Download Raw Data"
My script runs to the end, but doesn't seem to actually download any files. I tried installing puppeteer-extra and setting the downloads path:
const puppeteer = require("puppeteer-extra");
const { executablePath } = require('puppeteer')
...
var dir = "/home/ubuntu/AirlineStatsFetcher/downloads";
console.log('dir to set for downloads', dir);
puppeteer.use(require('puppeteer-extra-plugin-user-preferences')
(
{
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: true,
default_directory: dir,
},
plugins: {
always_open_pdf_externally: true
},
}
}));
const browser = await puppeteer.launch({
headless: true, slowMo: 100, executablePath: executablePath()
});
...
// Doesn't seem to work
await page.waitForSelector('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
console.log('Clicking on link to download CSV');
await page.click('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
After a while I figured why not tried to build the full URL and then do a GET request but then i run into other problems (UNABLE_TO_VERIFY_LEAF_SIGNATURE). Before going down this route farther (which feels a little hacky) I wanted to ask advice here.
Is there something I'm missing in terms of configuration to get it to download?
Downloading files using puppeteer seems to be a moving target btw not well supported today. For now (puppeteer 19.2.2) I would go with https.get instead.
"use strict";
const fs = require("fs");
const https = require("https");
// Not sure why puppeteer-extra is used... maybe https://stackoverflow.com/a/73869616/1258111 solves the need in future.
const puppeteer = require("puppeteer-extra");
const { executablePath } = require("puppeteer");
(async () => {
puppeteer.use(
require("puppeteer-extra-plugin-user-preferences")({
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: false,
},
plugins: {
always_open_pdf_externally: false,
},
},
})
);
const browser = await puppeteer.launch({
headless: true,
slowMo: 100,
executablePath: executablePath(),
});
const page = await browser.newPage();
await page.goto(
"https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp ",
{
waitUntil: "networkidle2",
}
);
const handle = await page.$(
"table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)"
);
const relativeZipUrl = await page.evaluate(
(anchor) => anchor.getAttribute("href"),
handle
);
const url = "https://www.transtats.bts.gov/OT_Delay/".concat(relativeZipUrl);
const encodedUrl = encodeURI(url);
//Don't use in production
https.globalAgent.options.rejectUnauthorized = false;
https.get(encodedUrl, (res) => {
const path = `${__dirname}/download.zip`;
const filePath = fs.createWriteStream(path);
res.pipe(filePath);
filePath.on("finish", () => {
filePath.close();
console.log("Download Completed");
});
});
await browser.close();
})();

How to use Puppeteer to download PDF files from a website?

I've been trying to use Puppeteer to download PDF files from a specific website but how do I get it to download all the files for example:
A file on the website is like example.com/Contents/xxx-1.pdf
A second file on the website is like example.com/Contents/xxx-2.pdf
How can I use puppeteer to download the file contents automatically by trying for each number added?
I've made a function that given a function with an index as parameter, returns the url of the pdf to download and a count that limits the downloads, it tries to download the pdf.
const puppeteer = require('puppeteer');
downloadFiles((i) => `example.com/Contents/xxx-${i}.pdf`, 20);
async function downloadFiles(url, count) {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
for (let i = 0; i < count; i++) {
const pageUrl = await url(i);
try {
await page.goto(pageUrl);
await page.pdf({
path: `pdf-${i}.pdf`,
format: 'A4',
printBackground: true
});
} catch (e) {
console.log(`Error loading ${pageUrl}`);
}
}
await browser.close();
}

Download website locally without Javascript using puppeteer

I am trying to download a website as static, I mean without JS, only HTML & CSS.
I've tried many approaches yet some issues still present regarding CSS and Images.
A snippet
const puppeteer = require('puppeteer');
const {URL} = require('url');
const fse = require('fs-extra');
const path = require('path');
(async (urlToFetch) => {
const browser = await puppeteer.launch({
headless: true,
slowMo: 100
});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "script") {
request.abort()
} else {
request.continue()
}
})
page.on('response', async (response) => {
const url = new URL(response.url());
let filePath = path.resolve(`./output${url.pathname}`);
if(path.extname(url.pathname).trim() === '') {
filePath = `${filePath}/index.html`;
}
await fse.outputFile(filePath, await response.buffer());
console.log(`File ${filePath} is written successfully`);
});
await page.goto(urlToFetch, {
waitUntil: 'networkidle2'
})
setTimeout(async () => {
await browser.close();
}, 60000 * 4)
})('https://stackoverflow.com/');
I've tried using
content = await page.content();
fs.writeFileSync('index.html', content, { encoding: 'utf-8' });
As well as, I download it using CDPSession.
I've tried it using website-scraper
So what is the best approach to come to a solution where I provide a website link, then It downloads it as static website.
Try using this https://www.npmjs.com/package/website-scraper
It will save the website into a local directory
Have you tried something like wget or curl?
wget -p https://stackoverflow.com/questions/67559777/download-website-locally-without-javascript-using-puppeteer
Should do the trick

Taking a screenshot with puppeteer and storing it in google cloud storage

I´ll describe my end goal right away: I want to be able to take screenshots of my website with puppeteer and upload them straight to google cloud storage (with cloud functions for example).
However, I've been running into an issue with actually uploading the file if I do not give a path to locally store it. This is the code I have:
(async () => {
const browser = await puppeteer.launch({headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
await page.goto('https://google.com');
const filename = await page.screenshot();
await storage.bucket(bucketName).upload(filename, {
gzip: true,
})
console.log(`${filename} uploaded to ${bucketName}.`);
await browser.close();
})();
I have tried a variety of things like encoding the image differently and converting it from a buffer to a string but I keep running into the same two errors, either:
The "path" argument must be of type string. Received an instance of Buffer, or
The argument 'path' must be a string or Uint8Array without null bytes.
I appreciate all the help I can get :D
Kind regards
There are three ways to do that.
First way you add path option to page.screenshot() to store
screenshot in local machine then add this path to
storage.bucket(bucketName).upload() to upload this image to google
cloud.
Second way you get screenshot image as base64 encoding then upload it to google cloud with specific name.
Third way you get screenshot image as binary encoding then upload it to google cloud with specific name.
First way
(async () => {
const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] });
const page = await browser.newPage();
await page.goto('https://google.com');
await page.screenshot({
path: `/screenshot.png`,
});
const bucket = storage.bucket('bucket_name');
const options = {
destination: 'puppeteer_screenshots/screenshot_XXX.png',
gzip: true,
};
await bucket.upload(`/screenshot.png`, options);
console.log("Created object gs://bucket_name/puppeteer_screenshots/screenshot_XXX.png");
await browser.close();
})();
Second way
(async () => {
const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] });
const page = await browser.newPage();
await page.goto('https://google.com');
const screenshotBase64 = await page.screenshot({
encoding: 'base64',
});
const bucket = storage.bucket('bucket_name');
const file = bucket.file('puppeteer_screenshots/screenshot_XXX.png');
await file.save(screenshotBase64, {
metadata: { contentType: 'image/png' },
});
console.log("Created object gs://bucket_name/puppeteer_screenshots/screenshot_XXX.png");
await browser.close();
})();
Third way
(async () => {
const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] });
const page = await browser.newPage();
await page.goto('https://google.com');
const screenshotBinary = await page.screenshot({ encoding: 'binary' });
const bucket = storage.bucket('bucket_name');
const file = bucket.file('puppeteer_screenshots/screenshot_XXX.png');
await file.save(screenshotBinary, {
metadata: { contentType: 'image/png' },
});
console.log("Created object gs://bucket_name/puppeteer_screenshots/screenshot_XXX.png");
await browser.close();
})();
References
puppeteer screenshot.
google cloud storage (.upload)
google cloud storage (.save)

How to find number of pages in a single pdf created via puppeteer

I am currently trying to find the number of pages in a single pdf / what is the total size of the pdf file created by puppeteer.page as per requirement
Here's what I did:
try {
const generatedPdfFilePath = `${directory}/feedback-${requestId}.pdf`;
const htmlFilePath = `${directory}/report-${requestId}.html`;
const htmlTemplate =
fs.readFileSync(path.join(process.cwd(), '/data/feedback-template.hbs'), 'utf-8');
const template = handlebars.compile(htmlTemplate);
const htmlFile = minify(template(data), {
collapseWhitespace: true,
});
fs.writeFileSync(htmlFilePath , htmlFile);
const options = {
format: 'A4',
printBackground: true,
path: generatedPdfFilePath ,
};
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true,
});
const page = await browser.newPage();
await page.goto(`file://${htmlFilePath}`, {
waitUntil: 'networkidle0',
timeout: 300000,
});
await page.pdf(options);
// Do something here to find number of pages in this pdf
await browser.close();
resolve({ file: generatedPdfFilePath });
} catch (error) {
console.log(error);
reject(error);
}
So far what I have done is created an html template for the pdf, then used puppeteer, headless chrome for nodejs to generate the required pdf of the page. But now Im sort of stuck because I want to know how many pages are actually in this pdf file or in other words what is the size of the pdf which I need in further calculations. I have only mentioned the relevant code here for ease.
Also, Im pretty new to puppeteer, Can someone explain how can I get details of this pdf. I have been searching for quite some time now and no luck. Puppeteer's documentation isn't helping in any case no details are there on why we do what we do. All I get is the details on pdf options..
docs
Any help would be much appreciated.
You can use the pdf-parse node module, like this:
const fs = require('fs');
const pdf = require('pdf-parse');
let dataBuffer = fs.readFileSync('path to PDF file...');
pdf(dataBuffer).then(function(data) {
// number of pages
console.log(data.numpages);
});
Your code would become something like:
const pdf = require('pdf-parse');
try {
const generatedPdfFilePath = `${directory}/feedback-${requestId}.pdf`;
const htmlFilePath = `${directory}/report-${requestId}.html`;
const htmlTemplate =
fs.readFileSync(path.join(process.cwd(), '/data/feedback-template.hbs'), 'utf-8');
const template = handlebars.compile(htmlTemplate);
const htmlFile = minify(template(data), {
collapseWhitespace: true,
});
fs.writeFileSync(htmlFilePath , htmlFile);
const options = {
format: 'A4',
printBackground: true,
path: generatedPdfFilePath ,
};
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true,
});
const page = await browser.newPage();
await page.goto(`file://${htmlFilePath}`, {
waitUntil: 'networkidle0',
timeout: 300000,
});
await page.pdf(options);
// Do something here to find number of pages in this pdf
let dataBuffer = fs.readFileSync(htmlFilePath);
const pdfInfo = await pdf(dataBuffer);
const numPages = pdfInfo.numpages;
await browser.close();
resolve({ file: generatedPdfFilePath });
} catch (error) {
console.log(error);
reject(error);
}

Categories

Resources