Downloading pdf files from direct download link using node/puppeteer/js

Downloading pdf files from direct download link using node/puppeteer/js - javascript

I need to download some 300 files from a direct download link. When the link is opened directly in the browser, an automatic pdf download gets triggered. The file gets downloaded and the browser doesn't go anywhere. The links are as follows:
www.link.com/store/item/123
In the link, the 123 part would be changed on every loop.
I was thinking of using puppeteer (with goto), but I guess since visiting the link automatically triggers the download of the pdf and doesnt actually go to the page, it fails.
This is what I tried, but its not working at all:
const links = ['123', '456'];
(async () => {
const browser = await puppeteer.launch({
headless: false //preferably would run with true
});
links.forEach( async link => {
const page = await browser.newPage();
await page.goto(
linkBeginning + link
);
await browser.close();
})
})();
I searched around, but I could not really find this specific case, all other cases are more focused on the user side or have the target file in the actual link (like xx/store/doc.pdf). Not quite sure if that makes a difference though. I would just need a script that will get me the pdf files for a one time run.
If anyone has a solution in php/python that would work as well, as this is just a one off thing.
edit:
ended up doing it in html
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<script src="sku.js"></script>
<script>
const linkStart = 'https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/';
sku.forEach(element => {
document.write('<a target = "_blank" class="click" href="' + linkStart + element.id +'">'+ element.id +'</a></br>')
});
</script>
</head>
<body></body>
</html>
<script>
const clickInterval = setInterval(function () {
const el = document.querySelector('.click:not(.clicked)');
if(el){
el.classList.add('clicked');
el.click()
} else {
clearInterval(clickInterval);
}
}, 2000);
</script>

You don't need puppeteer to do this, and you can achieve it fairly easily in NodeJS:
import http from "https";
import fs from "fs";
(async () => {
const skus = ["00548", "03575"];
const filesPromiseArray = skus.map(
(sku) =>
new Promise((resolve, reject) => {
const file = fs.createWriteStream(`${sku}.pdf`);
const request = http.get(`https://www.sols-europe.com/gb/pdfpublication/pdf/product/sku/${sku}`, (response) =>
response.pipe(file)
);
file.on("finish", resolve);
file.on("error", reject);
})
);
try {
await Promise.all(filesPromiseArray);
} catch {
console.log("There was an error downloading one of the files");
}
})();
What is this code doing?
Taking your array of skus, we are using .map() to transform them into an array of requests.
Inside the .map() we're creating a Promise which will be successful (resolve) when the file finishes downloading, or unsuccessful (reject) if the download errors.
We then await all of the requests that we have just created. If one of them fails it will log.
Note:
If you are using CommonJS ("type":"commonjs", in your package.json), replace the two imports with:
const http = require('https');
const fs = require('fs');

Your placement for browser.close() inside loop isn't a good thing.
So i moved it outside forEach and change it to page.close() instead.
const links = ['123', '456']
const linkBeginning = 'https://www.link.com/store/item/'
;(async () => {
const browser = await puppeteer.launch({
headless: false //preferably would run with true
})
links.forEach( async link => {
const page = await browser.newPage()
const session = await page.target().createCDPSession()
await session.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: './pdf/'
})
await page.goto(linkBeginning + link)
await page.close() // Don't use browser.close() inside loop
})
await browser.close() // Use here instead
})()

Related

Puppeteer Chromium not fully functional when spawned from Electron app

I'm new to this so please bare with me. I made a simple app for scraping some review data that I needed for work which uses puppeteer and chalk and is fully functional in this way. The code for the functional puppeteer version is as follows:
const puppeteer = require('puppeteer')
const chalk = require('chalk')
async function scrapeIt(){
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto(process.argv[2]);
await page.hover('#reviews');
await page.hover('#amenities');
page.on('console', consoleObj => {
if (consoleObj.type() === 'log') {
console.log(chalk.green(consoleObj.text()));
}
})
await page.evaluate(_ => {
setTimeout(() => {
everything();
function everything() {
let theseReviews = document.querySelectorAll('.review__content')
let uglyTitle = theseReviews[0].textContent.split('Stayed')[0]
let regex = new RegExp(".{1}([\\\/]).");
let regexMatch = RegExp("\\\d{2,6}");
let cleanTitle = uglyTitle.split(regex)[0];
theseReviews.forEach(function (el) {
uglyTitle = el.textContent.split('Stayed')[0]
cleanTitle = uglyTitle.split(regex)[0];
if (cleanTitle.match(regexMatch)){
console.log(cleanTitle + " MATCH")
}
})
}
let nextButton = document.querySelector('#reviews > div > div > div > div > div > div.review-list > div.pagination > button.btn.btn-icon.ButtonIcon.btn-default.btn-sm.pagination__next.btn-icon-circle > span.SVGIcon.SVGIcon--16px.flex-center');
nextButton.onclick = function () {
setTimeout(() => {
everything();
}, 1000);
}
}, 1000);
});
await page.waitForTimeout(3000)
await page.click('#reviews')
const hrefElement = await page.$$('div.review-list > div.pagination .SVGIcon');
const reviewCount = await page.$('.reviews-summary__reviews_count_small');
let value = await reviewCount.evaluate(el => el.textContent)
await page.waitForTimeout(2000)
console.log(`Acting on ${value}:`)
await hrefElement[1].click();
while (hrefElement[1]) {
await page.waitForTimeout(1000)
hrefElement[1].click()
} {
await page.close();
browser.close();
}
}
scrapeIt();
My next thought was to make a simple front-end to this code. I didn't want to use endpoints because running a server for this task seemed a little much. I figured Electron would be a good choice, my code for which is as follows (i know inline-scripts, nodeIntegration, and generally the way this is implemented is unsafe. it's not intended to ever go beyond personal use):
main.js
const {app, BrowserWindow} = require('electron')
function createWindow () {
const mainWindow = new BrowserWindow({
width: 400,
height:100,
webPreferences: {
nodeIntegration: true,
contextIsolation: false
}
})
mainWindow.setAlwaysOnTop(true, 'screen');
mainWindow.loadFile('index.html')
}
app.whenReady().then(() => {
createWindow()
app.on('activate', function () {
if (BrowserWindow.getAllWindows().length === 0) createWindow()
})
})
app.on('window-all-closed', function () {
if (process.platform !== 'darwin') app.quit()
})
index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<!-- https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP -->
<meta http-equiv="Content-Security-Policy" content="default-src 'self'; script-src 'self' 'unsafe-inline'"> <meta http-equiv="X-Content-Security-Policy" content="default-src 'self'; script-src 'self' 'unsafe-inline'">
<title>Review Scraper</title>
<script src="./reviewListing.js"></script>
</head>
<body>
URL: <input id="title"/>
<button id="btn" onclick = "reviewListing()" type="button">Set</button>
</body>
</html>
reviewListing.js
const puppeteer = require('puppeteer');
async function reviewListing() {
let title = document.getElementById('title').value
console.log(title)
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto(title);
await page.hover('#reviews'); //the reason I've only gone this far is because this is where it stops working.
await page.hover('#amenities');
}
Puppeteer stops working at this point. It can launch the browser, go to the link provided in the text area, take screenshots, close the browser, but it seems to have no control over scrolling, clicking, or any of those other baked in functions that work perfectly when it's not launched from an Electron app.
error thrown in app mainWindow(not puppeteer):
Passed function is not well-serializable!
I've tried everything I can think of including different require chains, packaging the application, pointing to a chrome directory on my machine instead of chromium, and some IPCmain and Renderer solutions to no avail. I've also tried the puppeteer-in-electron package, which seems to do nothing as far as even launching the browser at the point of writing.
Any time or insight is greatly appreciated.

Download website locally without Javascript using puppeteer

I am trying to download a website as static, I mean without JS, only HTML & CSS.
I've tried many approaches yet some issues still present regarding CSS and Images.
A snippet
const puppeteer = require('puppeteer');
const {URL} = require('url');
const fse = require('fs-extra');
const path = require('path');
(async (urlToFetch) => {
const browser = await puppeteer.launch({
headless: true,
slowMo: 100
});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "script") {
request.abort()
} else {
request.continue()
}
})
page.on('response', async (response) => {
const url = new URL(response.url());
let filePath = path.resolve(`./output${url.pathname}`);
if(path.extname(url.pathname).trim() === '') {
filePath = `${filePath}/index.html`;
}
await fse.outputFile(filePath, await response.buffer());
console.log(`File ${filePath} is written successfully`);
});
await page.goto(urlToFetch, {
waitUntil: 'networkidle2'
})
setTimeout(async () => {
await browser.close();
}, 60000 * 4)
})('https://stackoverflow.com/');
I've tried using
content = await page.content();
fs.writeFileSync('index.html', content, { encoding: 'utf-8' });
As well as, I download it using CDPSession.
I've tried it using website-scraper
So what is the best approach to come to a solution where I provide a website link, then It downloads it as static website.

Try using this https://www.npmjs.com/package/website-scraper
It will save the website into a local directory

Have you tried something like wget or curl?
wget -p https://stackoverflow.com/questions/67559777/download-website-locally-without-javascript-using-puppeteer
Should do the trick

How do I combine puppeteer plugins with puppeteer clusters?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?

You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions

You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();

How to securely preload content in the main process before creating the main window?

I'm building an App (Electron based) where I need to get an information from a third party website before the main window is created, but I'm a little bit confused about security measures. I'm using axios to do the HTTP request inside the main process because it is promise based and I can create the window after the website is fetched. My concerns are:
Enabling nodeIntegration is not good when messing with the renderer process because of cross-site-scripting attack. Should I include all nodejs modules in a preload.js like the following, for example.
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-Security-Policy" content="script-src 'self';">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Viewer</title>
</head>
<body>
<div id="box">
<form id='fo'>
<input type="text" id="num">
<button type="button" id="bttn">Random</button>
</form>
</div>
<script src="renderer.js"></script>
</body>
</html>
main.js
const electron = require('electron');
const cheerio = require('cheerio');
const axios = require('axios').default;
const path = require('path');
const {app, BrowserWindow, ipcMain, Menu, MenuItem,session} = electron;
let win;
let url = 'sampletext';
function createWindow() {
win = new BrowserWindow({
width: 400,
height: 250,
webPreferences:{
nodeIntegration: false,
contextIsolation: true,
preload: path.join(app.getAppPath(), 'preload.js')
},
show: false,
});
win.loadFile('index.html');
win.once('ready-to-show', () =>{
win.show();
});
win.on('closed', () =>{
win = null;
});
}
app.whenReady().then(getRequest().then(res => {
const $ = cheerio.load(res);
if($('infoNeeded')){
random = get_numbers($('infoNeeded').attr('href'));
}
createWindow();
}));
app.on('window-all-closed', () =>{
app.quit();
});
function getRequest() {
return axios.get(url).then(res => res.data).catch(err => console.log(err));
}
preload.js
//Instead of using getRequest() on main.js use this file
const electron = require('electron');
const remote = require('electron').remote;
const cheerio = require('cheerio');
const axios = require('axios').default;
let url = 'sampletext';
//So I can use it in renderer.js
window.getReq = function () {
return axios.get(url).then(res => res.data).catch(err => console.log(err));
}
window.parseInfo = function (data) {
const $ = cheerio.load(data);
if($('infoNeeded')){
return random = get_numbers($('infoNeeded').attr('href'));
}
return;
}
//Preload first request
window.getReq().then(doStuffHere);
renderer.js
let info;
//Keep updating the info
setInterval( () =>{
window.getReq().then(data => {
info = window.parseInfo(data);
});
}, 10000);
1) Is it ok to do nodejs require inside main process? If not, what's the secure way of doing it?
2) May I make HTTP requests inside main process? If yes, should I send a CSP header when doing so?
3) Instead of doing the request inside the main.js, should I use "webPreferences: preload" property and make the first HTTP request inside preload.js (Just like the above example) ? (I need to get the info before sending it to renderer.js)
I've already read https://www.electronjs.org/docs/tutorial/security, but I couldn't grasp their teaching. If you could provide an answer for how and when to use preload.js and CSP header I'll be very grateful.

Yes it is ok to use node.js require in the main process(use any library with error handling, cause it may crash the app)
You can make an HTTP request from the main process
You can use Preload.js if you need the code execution result in the renderer process.(You can also use the ipc)

Set localstorage items before page loads in puppeteer?

We have some routing logic that kicks you to the homepage if you dont have a JWT_TOKEN set... I want to set this before the page loads/before the js is invoked.
how do i do this ?

You have to register localStorage item like this:
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
You should do it after page page.goto - browser must have an url to register local storage item on it. After this, enter the same page once again, this time token should be here before the page is loaded.
Here is a fully working example:
const puppeteer = require('puppeteer');
const http = require('http');
const html = `
<html>
<body>
<div id="element"></div>
<script>
document.getElementById('element').innerHTML =
localStorage.getItem('token') ? 'signed' : 'not signed';
</script>
</body>
</html>`;
http
.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.write(html);
res.end();
})
.listen(8080);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080/');
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
await page.goto('http://localhost:8080/');
const text = await page.evaluate(
() => document.querySelector('#element').textContent
);
console.log(text);
await browser.close();
process.exit(0);
})();

There's some discussion about this in Puppeteer's GitHub issues.
You can load a page on the domain, set your localStorage, then go to the actual page you want to load with localStorage ready. You can also intercept the first url load to return instantly instead of actually load the page, potentially saving a lot of time.
const doSomePuppeteerThings = async () => {
const url = 'http://example.com/';
const browser = await puppeteer.launch();
const localStorage = { storageKey: 'storageValue' };
await setDomainLocalStorage(browser, url, localStorage);
const page = await browser.newPage();
// do your actual puppeteer things now
};
const setDomainLocalStorage = async (browser, url, values) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.',
});
});
await page.goto(url);
await page.evaluate(values => {
for (const key in values) {
localStorage.setItem(key, values[key]);
}
}, values);
await page.close();
};

in 2021 it work with following code:
// store in localstorage the token
await page.evaluateOnNewDocument (
token => {
localStorage.clear();
localStorage.setItem('token', token);
}, 'eyJh...9_8cw');
// open the url
await page.goto('http://localhost:3000/Admin', { waitUntil: 'load' });
The next line from the first comment does not work unfortunately
await page.evaluate(() => {
localStorage.setItem('token', 'example-token'); // not work, produce errors :(
});

Without requiring to double goTo this would work:
const browser = await puppeteer.launch();
browser.on('targetchanged', async (target) => {
const targetPage = await target.page();
const client = await targetPage.target().createCDPSession();
await client.send('Runtime.evaluate', {
expression: `localStorage.setItem('hello', 'world')`,
});
});
// newPage, goTo, etc...
Adapted from the lighthouse doc for puppeteer that do something similar: https://github.com/GoogleChrome/lighthouse/blob/master/docs/puppeteer.md

Try and additional script tag. Example:
Say you have a main.js script that houses your routing logic.
Then a setJWT.js script that houses your token logic.
Then within your html that is loading these scripts order them in this way:
<script src='setJWT.js'></script>
<script src='main.js'></script>
This would only be good for initial start of the page.
Most routing libraries, however, usually have an event hook system that you can hook into before a route renders. I would store the setJWT logic somewhere in that callback.

Develop Reference

JavaScript is the programming language of the Web.

Downloading pdf files from direct download link using node/puppeteer/js - javascript

Related

Puppeteer Chromium not fully functional when spawned from Electron app

Download website locally without Javascript using puppeteer

How do I combine puppeteer plugins with puppeteer clusters?

How to securely preload content in the main process before creating the main window?

Set localstorage items before page loads in puppeteer?

Categories

Resources