Puppeteer: how to download entire web page for offline use - javascript

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.
However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling
html_contents = await page.content()
and saving the results, but that saves a copy without any non-HTML elements.
Is there way to save webpages for offline use with Puppeteer?

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:
'use strict';
const puppeteer = require('puppeteer');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://en.wikipedia.org/wiki/MHTML');
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
await browser.close();
} catch (err) {
console.error(err);
}
})();

Related

I am trying to scrape an article from a website from its XCode with JavaScript but I am getting error "undefined"

I was following a quick scraping tutorial, but after following each step I am not able to replicate what was shown.
Basically, I want to scrape an article title by copying XCode of the element from a website, but I keep getting errors. Have a look at my code below.
const puppeteer = require('puppeteer');
async function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el] = await page.$x('/html/body/div[2]/div[4]/div/div[1]/section[1]/div[2]/section/div/div/div/div[2]/a[2]/span[2]/span');
const src = await el.getProperty('src');
const srcTxt = await src.jsonValue();
console.log(srcTxt);
}
scrapeProduct('https://www.kerkida.net/');
I receive an error code "undefined".
Node.js and Puppeteer are already installed.
Please advise with possible solutions.
I tried scraping an article title by copying the XCode of the element from a website and expected to receive it in JSON format.

Puppeteer page.content() not matching the html in chromium

When I try to run this:
const puppeteer = require('puppeteer')
async function start() {
const browser = await puppeteer.launch({headless: false, slowMo: 250})
const page = await browser.newPage()
await page.goto('https://www.lanuv.nrw.de/umwelt/luft/immissionen/messorte-und-werte', {waitUntil: "domcontentloaded"})
console.log(await page.content())
await browser.close()
}
start()
The page.content() does not match the html I can see in a browser. I´m assuming the website is JS-based and I am just getting the original (not hydrated) content of the website when it reached chromium.
How do I get the Html content I can see in chromium? I actually do not need the entire page, just some data of it. But page.$$ was not helpful either.

How do you scroll down on subscrollbar in Puppeteer?

I am new to using Puppeteer and I am trying to scrape some store hour information from a store website. When you open the section that has this information, there is a scroll bar (not the actual browser one) that you need to scroll all the way down to the bottom to load all of the data. Currently, I am extracting all the data on the page using the code from user codetinker from this stack overflow answer.
It is:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://netto.dk/find-butik/?mapData={%22coordinates%22:{%22lat%22:54.41413535660238,%22lng%22:13.985595703125002},%22zoom%22:7,%22input%22:%22%22}', { waitUntil: 'networkidle0' });
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
But this doesn't work, as the sub scrollbar hasn't been scrolled all the way down when Puppeteer does its magic.
What I want to accomplish is have Puppeteer scroll all the way down in this "sub" scrollbar so I can scrape all the data I need. Is there any way to accomplish this?

Puppeteer Error, Cannot read property 'getProperty' of undefined while scraping white pages

I'm trying to scrape an address from whitepages.com, but my scraper keeps throwing this error every time I run it.
(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined
here's my code:
const puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.
I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.
Why would this be happening and is there any way I can fix it?
You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().
(async() => {
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
} catch (err) {
console.error(err.message);
} finally {
await browser.close();
}
})();
The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.
I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.
const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
async function scrapeAddress(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
const [el]= await page.$x('//*[#id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
// console.log(el)
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')
I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues
Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.
I would suggest you to debug by using a headless browser:
const browser = await puppeteer.launch({headless: false});
I took the code that #mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:
first you must install a couple things in console bash so run the following two commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
Installing these will allow you to run the first few lines in #mbit 's code.
Then in this line of code:
const browser = await puppeteer.launch();
as a parameter to puppeteer.launch(); pass in the following:
{headless: false}
which should in turn look like this:
const browser = await puppeteer.launch({headless: false});
I also believe that the Path that #mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.
await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});
scrapeAddress({address})
After you do this you should be able to run your code and retrieve values
Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!
let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());
puppeteer = require('puppeteer')
async function scrapeAddress(url){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url,{waitUntil: 'networkidle0'});
//wait for xpath
await page.waitForXPath('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const [el]= await page.$x('//*[#id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
const txt = await el.getProperty('textContent');
const rawTxt = await txt.jsonValue();
console.log({rawTxt});
browser.close();
}
scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")

Puppeteer performance timeline?

Is there a way to record a performance timeline for tests run with Puppeteer?
(source: google.com)
Yes, just use page.tracing methods like in this example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.tracing.start({ path: 'trace.json' });
await page.goto('https://en.wikipedia.org');
await page.tracing.stop();
await browser.close();
})();
And then load trace.json file in Chrome Performance tab. If you want more details here is an article with a chapter dedicated to analyzing page tracing.

Categories

Resources