How to use Apify on Google Cloud Functions - javascript

I'm deploying some code using Apify as Google Cloud Functions. When triggered, the Cloud Function terminates silently. What am I doing wrong?
I have some working code using Apify 0.15.1. It runs fine locally. Once deployed as a Google Cloud Function, it fails silently without any clear error. The equivalent code using Puppeteer 1.18.1 works fine.
I've reproduced the issue using more simple code below. While this example doesn't strictly require Apify, I would like to be able to use the extra functionality provided by Apify.
Code using Apify:
const Apify = require("apify");
exports.screenshotApify = async (req, res) => {
let imageBuffer;
Apify.main(async () => {
const browser = await Apify.launchPuppeteer({ headless: true });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
imageBuffer = await page.screenshot({ fullPage: true });
await browser.close();
});
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Code using Puppeteer:
const puppeteer = require("puppeteer");
exports.screenshotPup = async (req, res) => {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
const imageBuffer = await page.screenshot({ fullpage: true });
await browser.close();
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Once deployed as a Google Cloud Function (with --trigger-http and --memory=2048), the Puppeteer variant works fine, while the Apify variant terminates silently without result (apart from an 'ok' / HTTP 200 return value).

Get rid of the Apify.main() function, it schedules the call to a later time, after your function already returned the result.

Related

Puppeteer "Failed to launch the browser process!" when launching multiple browsers [duplicate]

So what I am trying to do is to open puppeteer window with my google profile, but what I want is to do it multiple times, what I mean is 2-4 windows but with the same profile - is that possible? I am getting this error when I do it:
(node:17460) UnhandledPromiseRejectionWarning: Error: Failed to launch the browser process!
[45844:13176:0410/181437.893:ERROR:cache_util_win.cc(20)] Unable to move the cache: Access is denied. (0x5)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
'--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data',
);
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
Note: It is already pointed in the comments but there is a syntax error in the example. The launch should look like this:
const browser = await puppeteer.launch({
headless: false,
args: ['--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data']
});
The error is coming from the fact that you are launching multiple browser instances at the very same time hence the profile directory will be locked and cannot be moved to reuse by puppeteer.
You should avoid starting chromium instances with the very same user data dir at the same time.
Possible solutions
Make the opened windows sequential, can be useful if you have only a few. E.g.:
const firstFn = async () => await puppeteer.launch() ...
const secondFn = async () => await puppeteer.launch() ...
(async () => {
await firstFn()
await secondFn()
})();
Creating copies of the user-data-dir as User Data1, User Data2 User Data3 etc. to avoid conflict while puppeteer copies them. This could be done on the fly with Node's fs module or even manually (if you don't need a lot of instances).
Consider reusing Chromium instances (if your use case allows it), with browser.wsEndpoint and puppeteer.connect, this can be a solution if you would need to open thousands of pages with the same user data dir.
Note: this one is the best for performance as only one browser will be launched, then you can open as many pages in a for..of or regular for loop as you want (using forEach by itself can cause side effects), E.g.:
const puppeteer = require('puppeteer')
const urlArray = ['https://example.com', 'https://google.com']
async function fn() {
const browser = await puppeteer.launch({
headless: false,
args: ['--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data']
})
const browserWSEndpoint = await browser.wsEndpoint()
for (const url of urlArray) {
try {
const browser2 = await puppeteer.connect({ browserWSEndpoint })
const page = await browser2.newPage()
await page.goto(url) // it can be wrapped in a retry function to handle flakyness
// doing cool things with the DOM
await page.screenshot({ path: `${url.replace('https://', '')}.png` })
await page.goto('about:blank') // because of you: https://github.com/puppeteer/puppeteer/issues/1490
await page.close()
await browser2.disconnect()
} catch (e) {
console.error(e)
}
}
await browser.close()
}
fn()

Download website locally without Javascript using puppeteer

I am trying to download a website as static, I mean without JS, only HTML & CSS.
I've tried many approaches yet some issues still present regarding CSS and Images.
A snippet
const puppeteer = require('puppeteer');
const {URL} = require('url');
const fse = require('fs-extra');
const path = require('path');
(async (urlToFetch) => {
const browser = await puppeteer.launch({
headless: true,
slowMo: 100
});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "script") {
request.abort()
} else {
request.continue()
}
})
page.on('response', async (response) => {
const url = new URL(response.url());
let filePath = path.resolve(`./output${url.pathname}`);
if(path.extname(url.pathname).trim() === '') {
filePath = `${filePath}/index.html`;
}
await fse.outputFile(filePath, await response.buffer());
console.log(`File ${filePath} is written successfully`);
});
await page.goto(urlToFetch, {
waitUntil: 'networkidle2'
})
setTimeout(async () => {
await browser.close();
}, 60000 * 4)
})('https://stackoverflow.com/');
I've tried using
content = await page.content();
fs.writeFileSync('index.html', content, { encoding: 'utf-8' });
As well as, I download it using CDPSession.
I've tried it using website-scraper
So what is the best approach to come to a solution where I provide a website link, then It downloads it as static website.
Try using this https://www.npmjs.com/package/website-scraper
It will save the website into a local directory
Have you tried something like wget or curl?
wget -p https://stackoverflow.com/questions/67559777/download-website-locally-without-javascript-using-puppeteer
Should do the trick

How do I combine puppeteer plugins with puppeteer clusters?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?
You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions
You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();

puppeteer can't get page source after using evaluate()

I'm using puppeteer to interact with a website using the evaluate() function to maniupulate page front (i.e to click on certain items etc...), click through works fine but I can't return the page source after clicking using evaluate.
I have recreated the error in this simplified script below it loads google.com, clicks on 'I feel lucky' and should then return the page source of the loaded page:
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
response = await page.evaluate(() => {
document.getElementsByClassName('RNmpXc')[1].click()
});
await page.waitForNavigation({waitUntil: 'load'});
console.log(response.text());
}
main();
I get the following error:
TypeError: Cannot read property 'text' of undefined
UPDATE New code following suggestion to use page.content()
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
await page.evaluate(() => {
document.getElementsByClassName('RNmpXc')[1].click()
});
const source = await page.content()
console.log(source);
}
main();
I am now getting the following error:
Error: Execution context was destroyed, most likely because of a navigation.
My question is: How can I return page source using the .text() method after manipulating the webpage using the evaluate() method?
All suggestions / insight / proposals would be very much appreciated thanks.
Since you're asking for page source after javascript modification, I'd assume you want DOM and not the original HTML content. your evaluate function doesn't return anything which results in undefined response. You can use
const source = await page.evaluate(() => new XMLSerializer().serializeToString(document.doctype) + document.documentElement.outerHTML);
or
const source = await page.content();

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available.
Any idea how I can fix this error and resolve the issue?
I currently have
const browser = await puppeteer.launch({
args: [
"--proxy-server=https=myproxy:myproxyport",
"--no-sandbox",
'--disable-gpu',
"--disable-setuid-sandbox",
],
timeout: 0,
headless: true,
});
page.authentication
The correct format for proxy-server argument is,
--proxy-server=HOSTNAME:PORT
If it's HTTPS proxy, you can pass the username and password using page.authenticate before even doing a navigation,
page.authenticate({username:'user', password:'password'});
Complete code would look like this,
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
ignoreHTTPSErrors:true,
args: ['--no-sandbox','--proxy-server=HOSTNAME:PORT']
});
const page = await browser.newPage();
// Authenticate Here
await page.authenticate({username:user, password:password});
await page.goto('https://www.example.com/');
})();
Proxy Chain
If somehow the authentication does not work using above method, you might want to handle the authentication somewhere else.
There are multiple packages to do that, one is proxy-chain, with this, you can take one proxy, and use it as new proxy server.
The proxyChain.anonymizeProxy(proxyUrl) will take one proxy with username and password, create one new proxy which you can use on your script.
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
const oldProxyUrl = 'http://username:password#hostname:8000';
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
// Prints something like "http://127.0.0.1:12345"
console.log(newProxyUrl);
const browser = await puppeteer.launch({
args: [`--proxy-server=${newProxyUrl}`],
});
// Do your magic here...
const page = await browser.newPage();
await page.goto('https://www.example.com');
})();

Categories

Resources