How to use proxy servers with aws lambda + Puppeteer?

How to use proxy servers with aws lambda + Puppeteer? - javascript

I am trying to run puppeteer with proxy chain package on aws lambda but I am getting this error message:
"errorType": "Error",
"errorMessage": "Protocol error (Target.createTarget): Target closed.",
Code:
const chromium = require('chrome-aws-lambda');
const { addExtra } = require("puppeteer-extra");
const puppeteerExtra = addExtra(chromium.puppeteer);
const proxyChain = require('proxy-chain');
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteerExtra.use(StealthPlugin());
exports.handler = async (event, context, callback) => {
let finalResult = [];
const url = ``;
let browser;
const oldProxyUrl = ''; // --> bright data proxy
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
console.log("newProxyUrl", newProxyUrl)
try {
browser = await puppeteerExtra.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox', `--proxy-server=${newProxyUrl}`],
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless
});
const page = await browser.newPage();
await page.goto(url);
finalResult = await extractElements(page);
} catch (error) {
return callback(error);
} finally {
await browser.close();
}
return callback(null, finalResult);
};
Above code works fine on aws lambda without proxy-server url. I also tested same code without proxy server url on serverless functions like vercel and netlify and it worked. Only issue is when I add proxy server url it throws protocol error.

Here are a few things you can try to troubleshoot this issue:
Make sure that the url variable has a value. This is currently an empty string, which means that the page.goto() method will not have a valid URL to navigate to.
Make sure that the oldProxyUrl variable has a value. This is currently an empty string, which means that the proxyChain.anonymizeProxy() method will not have a valid proxy to anonymize.
Make sure that the extractElements() function is defined and can be called. This function is not present in the code you provided, so you may need to include it or modify the code to remove the call to this function.
Check the logs of your AWS Lambda function to see if there are any additional error messages that might provide more information about the issue.
Check the documentation for the puppeteer-extra-plugin-stealth and proxy-chain packages to see if there are any known issues or compatibility issues with AWS Lambda.

Related

Is there a way to open multiple tabs simultaneously on Playwright or Puppeteer to complete the same tasks?

I just started coding, and I was wondering if there was a way to open multiple tabs concurrently with one another. Currently, my code goes something like this:
const puppeteer = require("puppeteer");
const rand_url = "https://www.google.com";
async function initBrowser() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(rand_url);
await page.setViewport({
width: 1200,
height: 800,
});
return page;
}
async function login(page) {
await page.goto("https://www.google.com");
await page.waitFor(100);
await page.type("input[id ='user_login'", "xxx");
await page.waitFor(100);
await page.type("input[id ='user_password'", "xxx");
}
this is not my exact code, replaced with different aliases, but you get the idea. I was wondering if there was anyone out there that knows the code that allows this same exact browser to be opened on multiple instances, replacing the respective login info only. Of course, it would be great to prevent my IP from getting banned too, so if there was a way to apply proxies to each respective "browser"/ instance, that would be perfect.
Lastly, I would like to know whether or not playwright or puppeteer is superior in the way they can handle these multiple instances. I don't even know if this is a possibility, but please enlighten me. I want to learn more.

You can use multiple browser window as different login/cookies.
For simplicity, you can use the puppeteer-cluster module by Thomas Dondorf.
This module can make your puppeteer launched and queued one by one so that you can use this to automating your login, and even save login cookies for the next launches.
Feel free to go to the Github: https://github.com/thomasdondorf/puppeteer-cluster
const { Cluster } = require('puppeteer-cluster')
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // <= this is the number of
// parallel task running simultaneously
}) // You can change to the number of CPU
const cpuNumber = require('os').cpus().length // for example
await cluster.task(async ({ page, data: [username, password] }) => {
await page.goto('https://www.example.com')
await page.waitForTimeout(100)
await page.type('input[id ="user_login"', username)
await page.waitForTimeout(100)
await page.type('input[id ="user_password"', password)
const screen = await page.screenshot()
// Store screenshot, Save Cookies, do something else
});
cluster.queue(['myFirstUsername', 'PassW0Rd1'])
cluster.queue(['anotherUsername', 'Secr3tAgent!'])
// cluster.queue([username, password])
// username and password array passed into cluster task function
// many more pages/account
await cluster.idle()
await cluster.close()
})()
For Playwright, sadly still unsupported by the module above,you can use browser pool (cluster) module to automating the Playwright launcher.
And for proxy usage, I recommend Puppeteer library as the legendary one.
Don't forget to choose my answer as the right one, if this helps you.

There are profiling and proxy options; you could combine them to achieve your goal:
Profile, https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context
import { chromium } from 'playwright'
const userDataDir = /tmp/ + process.argv[2]
const browserContext = await chromium.launchPersistentContext(userDataDir)
// ...
Proxy, https://playwright.dev/docs/api/class-browsertype#browser-type-launch
import { chromium } from 'playwright'
const proxy = { /* secret */ }
const browser = await chromium.launch({
proxy: { server: 'pre-context' }
})
const browserContext = await browser.newContext({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`,
username: proxy.username,
password: proxy.password,
}
})
// ...

The npm puppeter package is returning an error regarding node_package space

Here is the error that it is returning
Here is a picture of the error i'm getting when I run npm install puppeter
I found some stuff online with permissions, but this is about node_package space. It is not disk space as I've looked over my disk storage availability and there's plenty. Working on an Apify SDK, and I'm following the documentation, but the console is returning a whole bunch of error messages.
Can someone please help?
const Apify = require('apify')
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: 'https://www.iana.org/' });
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
handlePageFunction: async ({ request, page }) => {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
await Apify.utils.enqueueLinks({
requestQueue,
page,
pseudoUrls: ['https://www.iana.org/[.*]'],
});
},
});
await crawler.run();
});

How to use Apify on Google Cloud Functions

I'm deploying some code using Apify as Google Cloud Functions. When triggered, the Cloud Function terminates silently. What am I doing wrong?
I have some working code using Apify 0.15.1. It runs fine locally. Once deployed as a Google Cloud Function, it fails silently without any clear error. The equivalent code using Puppeteer 1.18.1 works fine.
I've reproduced the issue using more simple code below. While this example doesn't strictly require Apify, I would like to be able to use the extra functionality provided by Apify.
Code using Apify:
const Apify = require("apify");
exports.screenshotApify = async (req, res) => {
let imageBuffer;
Apify.main(async () => {
const browser = await Apify.launchPuppeteer({ headless: true });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
imageBuffer = await page.screenshot({ fullPage: true });
await browser.close();
});
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Code using Puppeteer:
const puppeteer = require("puppeteer");
exports.screenshotPup = async (req, res) => {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
await page.goto("https://xenaccounting.com");
const imageBuffer = await page.screenshot({ fullpage: true });
await browser.close();
if (res) {
res.set("Content-Type", "image/png");
res.send(imageBuffer);
}
return imageBuffer;
};
Once deployed as a Google Cloud Function (with --trigger-http and --memory=2048), the Puppeteer variant works fine, while the Apify variant terminates silently without result (apart from an 'ok' / HTTP 200 return value).

Get rid of the Apify.main() function, it schedules the call to a later time, after your function already returned the result.

Using testCafe 'requestLogger' to retrieve chrome performance logs fails

This is in continuation of this thread: Is there a way in TestCafe to validate Chrome network calls?
Here is my testCafe attempt to retrieve all the network logs(i.e. network tab in developer tools) from chrome. I am having issues getting anything printed on console.
const logger = RequestLogger('https://studenthome.com',{
logRequestHeaders: true,
logRequestBody: true,
logResponseHeaders: true,
logResponseBody: true
});
test
('My test - Demo', async t => {
await t.navigateTo('https://appURL.com/app/home/students');//navigate to app launch
await page_students.click_studentNameLink();//click on student name
await t
.expect(await page_students.exists_ListPageHeader()).ok('do something async', { allowUnawaitedPromise: true }) //validate list header
await t
.addRequestHooks(logger) //start tracking requests
let url = await page_studentList.click_tab();//click on the tab for which requests need to be validated
let c = await logger.count; //check count of request. Should be 66
await console.log(c);
await console.log(logger.requests[2]); // get the url for 2nd request
});
I see this in console:
[Function: count]
undefined
Here is picture from google as an illustration of what I am trying to achieve. I navigate to google.com and opened up developer tools> network tab. Then I clicked on store link and captured logs. The request URLs I am trying to collect are highlighted. I can get all the urls and then filter to the one I require.
The following, I have already tried
await console.log(logger.requests); // undefined
await console.log(logger.requests[*]); // undefined
await console.log(logger.requests[0].response.headers); //undefined
await logger.count();//count not a function
I would appreciate if someone could point me in the right direction?

You are using different urls in your test page ('https://appURL.com/app/home/students') and your logger ('https://studenthome.com'). This is probably the cause.
Your Request Logger records only requests to 'https://studenthome.com'.
In your screenshot I see the url 'http://store.google.com', which differs from the logger url, so the logger does not process it.
You can pass a RegExp as a first arg of the RequestLogger constructor to all requests which match your RegExp.
I have created a sample:
import { RequestLogger } from 'testcafe';
const logger = RequestLogger(/google/, {
logRequestHeaders: true,
logRequestBody: true,
logResponseHeaders: true,
logResponseBody: true
});
fixture `test`
.page('http://google.com');
test('test', async t => {
await t.addRequestHooks(logger);
await t.typeText('input[name="q"]', 'test');
await t.typeText('input[name="q"]', '1');
await t.typeText('input[name="q"]', '2');
await t.pressKey('enter');
const logRecord = logger.requests.length;
console.log(logger.requests.map(r => r.request.url));
});

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available.
Any idea how I can fix this error and resolve the issue?
I currently have
const browser = await puppeteer.launch({
args: [
"--proxy-server=https=myproxy:myproxyport",
"--no-sandbox",
'--disable-gpu',
"--disable-setuid-sandbox",
],
timeout: 0,
headless: true,
});

page.authentication
The correct format for proxy-server argument is,
--proxy-server=HOSTNAME:PORT
If it's HTTPS proxy, you can pass the username and password using page.authenticate before even doing a navigation,
page.authenticate({username:'user', password:'password'});
Complete code would look like this,
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
ignoreHTTPSErrors:true,
args: ['--no-sandbox','--proxy-server=HOSTNAME:PORT']
});
const page = await browser.newPage();
// Authenticate Here
await page.authenticate({username:user, password:password});
await page.goto('https://www.example.com/');
})();
Proxy Chain
If somehow the authentication does not work using above method, you might want to handle the authentication somewhere else.
There are multiple packages to do that, one is proxy-chain, with this, you can take one proxy, and use it as new proxy server.
The proxyChain.anonymizeProxy(proxyUrl) will take one proxy with username and password, create one new proxy which you can use on your script.
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
const oldProxyUrl = 'http://username:password#hostname:8000';
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
// Prints something like "http://127.0.0.1:12345"
console.log(newProxyUrl);
const browser = await puppeteer.launch({
args: [`--proxy-server=${newProxyUrl}`],
});
// Do your magic here...
const page = await browser.newPage();
await page.goto('https://www.example.com');
})();

Develop Reference

JavaScript is the programming language of the Web.

How to use proxy servers with aws lambda + Puppeteer? - javascript

Related

Is there a way to open multiple tabs simultaneously on Playwright or Puppeteer to complete the same tasks?

The npm puppeter package is returning an error regarding node_package space

How to use Apify on Google Cloud Functions

Using testCafe 'requestLogger' to retrieve chrome performance logs fails

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

Categories

Resources