I'm building a pdf report with handlebars and puppeteer to save into my db. The pdf works fine except that I can't seem to get images that I have stored in an assets directory to load. I am console logging the filePath that I am traversing to when attempting to get the images and they come through appropriately. Just not sure why the images are not loading, any help appreciated.
Here is my code so far.
const puppeteer = require("puppeteer");
const hbs = require("handlebars");
const fs = require("fs-extra");
const path = require("path");
// compiles the handlebars docs
const compile = async (templateName, data) => {
const filePath = path.join(__dirname, "templates", `${templateName}.hbs`);
if (!filePath) {
throw new Error(`Could not find ${templateName}.hbs in generatePDF`);
}
const html = await fs.readFile(filePath, "utf-8");
return hbs.compile(html)(data);
};
// helper for getting the images
hbs.registerHelper("getIntro", (context, idx) => {
const filePath = path.join(
__dirname,
"assets",
`${context}_intro_${idx}.png`
);
return filePath;
});
// use puppeteer to take in compiled hbs doc and create a pdf
const generatePDF = async (fileName, data) => {
const preparedData = prepareDataForPDF(data);
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: true
});
const page = await browser.newPage();
const content = await compile(fileName, preparedData);
await page.goto(`data: text/html;charset=UTF-8, ${content}`, {
waitUntil: "networkidle0"
});
await page.setContent(content);
await page.emulateMedia("screen");
await page.waitFor(100);
const pdf = await page.pdf({
format: "A4",
printBackground: true
});
return pdf;
};
module.exports = generatePDF;
<!-- handlebars partial that handles image loading -->
{{#loop 14}}
<div class="page">
<img src="{{getIntro assessmentType this}}" class="intro-page-content">
</div>
{{/loop}}
<!-- used like this in my main file -->
{{> intro assessmentType="Leadership"}}
Related
I have created a handlebars template and compiled it with my custom data. Everything else works fine just the image is not being displayed in the template.
I have done the following:
Made the images folder static using express
app.use(express.static("assets"));
used the image in the template
<img src="sign.png"/>
Here is my code where I'm using the template
const compile = async function (templateName, data) {
const filePath = path.join(process.cwd(), 'Template', `${templateName}.hbs`);
const html = await fs.readFile(filePath, 'utf-8');
return hbs.compile(html)(data);
}
const browser = await puppeteer.launch();
const page = await browser.newPage();
const content = await compile('Passes', jsonData);
await page.setContent(content);
await page.emulateMediaType('screen');
await page.pdf({
path: 'PASSES/' + filenameWE + '.pdf',
format: 'A4',
printBackground: true
});
console.log("Done ---- " + filenameWE);
await browser.close();
Again, I have tried the following:
Made the images folder static using express
used the image in the template
I am trying to download a website as static, I mean without JS, only HTML & CSS.
I've tried many approaches yet some issues still present regarding CSS and Images.
A snippet
const puppeteer = require('puppeteer');
const {URL} = require('url');
const fse = require('fs-extra');
const path = require('path');
(async (urlToFetch) => {
const browser = await puppeteer.launch({
headless: true,
slowMo: 100
});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "script") {
request.abort()
} else {
request.continue()
}
})
page.on('response', async (response) => {
const url = new URL(response.url());
let filePath = path.resolve(`./output${url.pathname}`);
if(path.extname(url.pathname).trim() === '') {
filePath = `${filePath}/index.html`;
}
await fse.outputFile(filePath, await response.buffer());
console.log(`File ${filePath} is written successfully`);
});
await page.goto(urlToFetch, {
waitUntil: 'networkidle2'
})
setTimeout(async () => {
await browser.close();
}, 60000 * 4)
})('https://stackoverflow.com/');
I've tried using
content = await page.content();
fs.writeFileSync('index.html', content, { encoding: 'utf-8' });
As well as, I download it using CDPSession.
I've tried it using website-scraper
So what is the best approach to come to a solution where I provide a website link, then It downloads it as static website.
Try using this https://www.npmjs.com/package/website-scraper
It will save the website into a local directory
Have you tried something like wget or curl?
wget -p https://stackoverflow.com/questions/67559777/download-website-locally-without-javascript-using-puppeteer
Should do the trick
I need to download a image with puppeteer. the problem here, buffer return by goto method. i think it return sequence as the image load. so writeFile method only get the last buffer. Is there other promise method to tackle sequence buffer?
const puppeteer = require('puppeteer-core');
const fs = require('fs').promises;
(async () => {
const options = {
product: 'chrome',
headless: true,
pipe: true,
executablePath: 'chrome.exe'
};
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
const response = await page.goto('https://static.wikia.nocookie.net/naruto/images/d/dd/Naruto_Uzumaki%21%21.png/revision/latest?cb=20161013233552');
// save buffer to file
await fs.writeFile('file.jpg', await response.buffer());
browser.close();
})();
Download with puppeteer is an hell...
for simple downloading in a folder on you pc:
await page._client.send('Page.setDownloadBehavior', {behavior: 'allow',
downloadPath: 'C:\folder...'}); // this set the destionation of file dowloaded by pup.
// then page.goto ....
This is a full-example: you can use to dowload any kind of resources, also if a login is required :-)
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
(async () => {
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto('https://static.wikia.nocookie.net')
// ...eventually login
let fn = async (URI) => {
// fetch the page from inside the html page
const res = await fetch(URI, {'credentials': 'same-origin'});
// attachment; filename="...."; size="1234"
let str = res.headers.get('Content-Disposition');
const regex = /filename="([^"]*)".*size="([^"]*)"/gm;
let m = regex.exec(str);
let filename = m?m[1]:null;
let size = m?m[2]:null;
let blob = await res.blob()
let bufferArray = await blob.arrayBuffer();
var base64String = btoa([].reduce.call(new Uint8Array(bufferArray),function(p,c){return p+String.fromCharCode(c)},''))
return {base64String, size, filename};
}
let x = await page.evaluate(fn, 'https://static.wikia.nocookie.net/naruto/images/d/dd/Naruto_Uzumaki%21%21.png/revision/latest?cb=20161013233552')
// x.base64String <- buffer, you can write
// x.filename <- name of file downloaded
// x.size <- size
// console.log(x.blob)
await fs.writeFile('message1.png', x.base64String, {encoding: "base64"})
console.log('ok!')
})().then();
Trying to build a small scraper. To reuse functionality I thought 'Page Object Models' would come in handy.
In main.js I require multiple small scripts, in the example below there is only one model (GooglePage).
The scripts work. But I would like to know how to pass a value from the google.js script back to the main script.
I want to use the value of the 'pageCountClean' variable in the main.js script to use in the rest of the application.
Have been searching for information about passing values and functions between scripts. For exporting values from pageconstructors, for promise await export function.
But I am lost. Do I have to use Promises?, is the current way of require/importing and exporting enough to create the relationship between the scripts?
Any pointers are welcome.
//////////// main.js
const { chromium } = require('playwright');
const { GooglePage } = require('./models/Google');
(async () => {
const browser = await chromium.launch({ headless: true, slowMo: 250 });
const context = await browser.newContext();
const GoogleUrl80 = https://www.google.nl/search?q=site%3Anu.nl;
// Cookie consent:
console.log('Cookie consent - start');
const page80 = await browser.newPage();
await page80.goto('https://google.nl');
await page80.waitForTimeout(1000);
await page80.keyboard.press('Tab');
await page80.keyboard.press('Tab');
await page80.keyboard.press('Enter');
console.log('Cookie Consent - done');
// Number of urls in google.nl (using google.js)
await page80.goto(GoogleUrl80, {waitUntil: 'networkidle'});
const googlePage80 = new GooglePage(page80);
await googlePage80.scrapeGoogle();
// Want to console.log 'pageCountClean' here.
await browser.close()
})()
//////////// Google.js
class GooglePage {
constructor(page) {
this.page = page;
}
async scrapeGoogle() {
const GoogleXpath = '//div[#id="result-stats"]';
const pageCount = await this.page.$eval(GoogleXpath, (el) => el.innerText);
const pageCountClean = pageCount.split(" ")[1];
console.log(pageCountClean);
}
}
module.exports = { GooglePage };
You can just return pageCountClean from your async function and await it in your main.js file:
in Google.js:
async scrapeGoogle() {
const GoogleXpath = '//div[#id="result-stats"]';
const pageCount = await this.page.$eval(GoogleXpath, (el) => el.innerText);
const pageCountClean = pageCount.split(" ")[1];
console.log(pageCountClean);
return pageCountClean;
}
in main.js:
const googlePage80 = new GooglePage(page80);
const result = await googlePage80.scrapeGoogle();
console.log(result);
I am currently trying to find the number of pages in a single pdf / what is the total size of the pdf file created by puppeteer.page as per requirement
Here's what I did:
try {
const generatedPdfFilePath = `${directory}/feedback-${requestId}.pdf`;
const htmlFilePath = `${directory}/report-${requestId}.html`;
const htmlTemplate =
fs.readFileSync(path.join(process.cwd(), '/data/feedback-template.hbs'), 'utf-8');
const template = handlebars.compile(htmlTemplate);
const htmlFile = minify(template(data), {
collapseWhitespace: true,
});
fs.writeFileSync(htmlFilePath , htmlFile);
const options = {
format: 'A4',
printBackground: true,
path: generatedPdfFilePath ,
};
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true,
});
const page = await browser.newPage();
await page.goto(`file://${htmlFilePath}`, {
waitUntil: 'networkidle0',
timeout: 300000,
});
await page.pdf(options);
// Do something here to find number of pages in this pdf
await browser.close();
resolve({ file: generatedPdfFilePath });
} catch (error) {
console.log(error);
reject(error);
}
So far what I have done is created an html template for the pdf, then used puppeteer, headless chrome for nodejs to generate the required pdf of the page. But now Im sort of stuck because I want to know how many pages are actually in this pdf file or in other words what is the size of the pdf which I need in further calculations. I have only mentioned the relevant code here for ease.
Also, Im pretty new to puppeteer, Can someone explain how can I get details of this pdf. I have been searching for quite some time now and no luck. Puppeteer's documentation isn't helping in any case no details are there on why we do what we do. All I get is the details on pdf options..
docs
Any help would be much appreciated.
You can use the pdf-parse node module, like this:
const fs = require('fs');
const pdf = require('pdf-parse');
let dataBuffer = fs.readFileSync('path to PDF file...');
pdf(dataBuffer).then(function(data) {
// number of pages
console.log(data.numpages);
});
Your code would become something like:
const pdf = require('pdf-parse');
try {
const generatedPdfFilePath = `${directory}/feedback-${requestId}.pdf`;
const htmlFilePath = `${directory}/report-${requestId}.html`;
const htmlTemplate =
fs.readFileSync(path.join(process.cwd(), '/data/feedback-template.hbs'), 'utf-8');
const template = handlebars.compile(htmlTemplate);
const htmlFile = minify(template(data), {
collapseWhitespace: true,
});
fs.writeFileSync(htmlFilePath , htmlFile);
const options = {
format: 'A4',
printBackground: true,
path: generatedPdfFilePath ,
};
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true,
});
const page = await browser.newPage();
await page.goto(`file://${htmlFilePath}`, {
waitUntil: 'networkidle0',
timeout: 300000,
});
await page.pdf(options);
// Do something here to find number of pages in this pdf
let dataBuffer = fs.readFileSync(htmlFilePath);
const pdfInfo = await pdf(dataBuffer);
const numPages = pdfInfo.numpages;
await browser.close();
resolve({ file: generatedPdfFilePath });
} catch (error) {
console.log(error);
reject(error);
}