page.setCookie as no affect when using Puppeteer on Ubuntu - javascript

I am having a strange issue with Puppeteer. Current cookie setting code as follows:
(Save cookie)
const cookies = await page.cookies();
await checkMongoConnection();
account.cookies = JSON.stringify(cookies, null, null);
await account.save();
await closeMongoConnection();
(Load Cookie)
const options = {
headless: true,
defaultViewport: { width: 1366, height: 768 },
ignoreHTTPSErrors: true,
args: [
'--disable-sync',
'--ignore-certificate-errors'
],
ignoreDefaultArgs: ['--enable-automation']
};
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
// Cookies
if (account.cookies) {
// I have checked this with console.log it does contain cookies
const cookies = JSON.parse(account.cookies);
await page.setCookie(...cookies);
}
await page.goto('https://www.some-website.com');
This works without any issues when run on macOS (both with headless set to false and true), also note I am using Chromium.
However when I try to run this setup on my Linux Ubuntu server just setting the cookie has no affect. Has anyone else come across this issue before? Any ideas what I might be doing wrong here?
When I log the cookies I get from the database I get something like:
[
{
name: 'personalization_id',
value: '"v1_FijCjdT7iRj3K+cbhPiPIg=="',
domain: '.somedomain.com',
path: '/',
expires: 1664967308.337118,
size: 47,
httpOnly: false,
secure: true,
session: false,
sameSite: 'None'
}, ....
I have tried logging the cookies after they are set:
// Cookies
if (account.cookies) {
const cookies = JSON.parse(account.cookies);
await page.setCookie(...cookies);
}
console.log('check cookies');
const newCookies = await page.cookies();
console.log(newCookies);
This just results in an empty array [], so it seems they are refusing to set.

Related

Puppeteer not actually downloading ZIP despite Clicking Link

I've been making incremental progress, but I'm fairly stumped at this point.
This is the site I'm trying to download from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
The reason I'm using Puppeteer is because I can't find a supported API to get this data (if there is one happy to try it)
The link is "Download Raw Data"
My script runs to the end, but doesn't seem to actually download any files. I tried installing puppeteer-extra and setting the downloads path:
const puppeteer = require("puppeteer-extra");
const { executablePath } = require('puppeteer')
...
var dir = "/home/ubuntu/AirlineStatsFetcher/downloads";
console.log('dir to set for downloads', dir);
puppeteer.use(require('puppeteer-extra-plugin-user-preferences')
(
{
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: true,
default_directory: dir,
},
plugins: {
always_open_pdf_externally: true
},
}
}));
const browser = await puppeteer.launch({
headless: true, slowMo: 100, executablePath: executablePath()
});
...
// Doesn't seem to work
await page.waitForSelector('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
console.log('Clicking on link to download CSV');
await page.click('table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)');
After a while I figured why not tried to build the full URL and then do a GET request but then i run into other problems (UNABLE_TO_VERIFY_LEAF_SIGNATURE). Before going down this route farther (which feels a little hacky) I wanted to ask advice here.
Is there something I'm missing in terms of configuration to get it to download?
Downloading files using puppeteer seems to be a moving target btw not well supported today. For now (puppeteer 19.2.2) I would go with https.get instead.
"use strict";
const fs = require("fs");
const https = require("https");
// Not sure why puppeteer-extra is used... maybe https://stackoverflow.com/a/73869616/1258111 solves the need in future.
const puppeteer = require("puppeteer-extra");
const { executablePath } = require("puppeteer");
(async () => {
puppeteer.use(
require("puppeteer-extra-plugin-user-preferences")({
userPrefs: {
download: {
prompt_for_download: false,
open_pdf_in_system_reader: false,
},
plugins: {
always_open_pdf_externally: false,
},
},
})
);
const browser = await puppeteer.launch({
headless: true,
slowMo: 100,
executablePath: executablePath(),
});
const page = await browser.newPage();
await page.goto(
"https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp ",
{
waitUntil: "networkidle2",
}
);
const handle = await page.$(
"table > tbody > tr > .finePrint:nth-child(3) > a:nth-child(2)"
);
const relativeZipUrl = await page.evaluate(
(anchor) => anchor.getAttribute("href"),
handle
);
const url = "https://www.transtats.bts.gov/OT_Delay/".concat(relativeZipUrl);
const encodedUrl = encodeURI(url);
//Don't use in production
https.globalAgent.options.rejectUnauthorized = false;
https.get(encodedUrl, (res) => {
const path = `${__dirname}/download.zip`;
const filePath = fs.createWriteStream(path);
res.pipe(filePath);
filePath.on("finish", () => {
filePath.close();
console.log("Download Completed");
});
});
await browser.close();
})();

Puppeteer - Cannot read property 'executablePath' of undefined

Regardless what i tried , i'm always getting executablePath is undefined. Unfortunately there's not much info on this on google. It would be great anyone could let me know where to dig into deeper to solve this error. revisionInfo is returning undefined.
Error
BrowserFetcher {
_product: 'chrome',
_downloadsFolder: '/var/www/node_modules/puppeteer/.local-chromium',
_downloadHost: 'https://storage.googleapis.com',
_platform: 'linux' }
TypeError: Cannot read property 'executablePath' of undefined
at demo1 (/var/www/filename.js:10:36)
Source Code
const puppeteer = require('puppeteer');
const demo1 = async () => {
try {
const browserFetcher = puppeteer.createBrowserFetcher();
console.log(browserFetcher);
const revisionInfo = await browserFetcher.download('970485');
const browser = await puppeteer.launch({
headless: false,
executablePath: revisionInfo.executablePath,
args: ['--window-size=1920,1080', '--disable-notifications'],
});
const page = await browser.newPage();
await page.setViewport({
width: 1080,
height: 1080,
});
await page.goto('https://example.com', {
waitUntil: 'networkidle0',
});
await page.close();
await browser.close();
} catch (e) {
console.error(e);
}
};
demo1();
Based on your error message, the problem is with this line
executablePath: revisionInfo.executablePath,
where revisionInfo is undefined, meaning this does not give you the data you want:
const revisionInfo = await browserFetcher.download('970485');
If you really need a specific executablePath, you need to make sure that revisionInfo gets the value you want.
Otherwise, you can just remove the line executablePath: revisionInfo.executablePath, and let puppeteer use its default chromium browser.
Look into two things
If you did a apt install chromium-browser , remove that
Try run and install using a x86 server instead of ARM based Server (t4g instance by aws)
Either one of those solved my issue. The code was still the same.

Problem overwriting navigator.languages in puppeteer

I have a problem that I can't solve, to finish making anonymous puppeteer.
So far I have passed all the anti-bot tests, but I can't configure the language, let me explain:
Overwriting the user agent, I manage to change the "navigator.language" from "en-US, in" to "es-ES, es"
But I have tried everything and I am not able to overwrite the "navigator.languages" it always remains in "en-US, en"
I hope there is someone who can help me change the languages.
I attach screenshots and link of the plugin I use.
https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth
https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js
const puppeteer = require("puppeteer-extra");
// add stealth plugin and use defaults (all evasion techniques)
const stealth_plugin = require("puppeteer-extra-plugin-stealth");
const stealth = stealth_plugin();
puppeteer.use(stealth);
const UserAgentOverride = require("puppeteer-extra-plugin-stealth/evasions/user-agent-override");
const ua = UserAgentOverride({locale: "es-ES,es;q=0.9", userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36", platform: "MacIntel"});
const path = require('path')
const websites = require('./websites.json')
async function run() {
puppeteer.use(ua);
const browser = await puppeteer.launch({
headless: false,
userDataDir: "./cache",
ignoreHTTPSErrors: true,
ignoreDefaultArgs: [
"--disable-extensions",
"--enable-automation",
],
args: [
"--lang=es-ES,es;q=0.9",
"--no-sanbox",
"--disable-dev-shm-usage",
"--disable-gpu"
]
})
console.log(await browser.userAgent());
const page = await browser.newPage()
const pathRequire = path.join(__dirname, 'src/scripts/index.js')
for (const website of websites) {
require(pathRequire)(page, website)
}
}
run().catch(error => { console.error("Something bad happend...", error); });
Image of anti bot test results:
Hi there
Thanks for the answer, after testing the edited code, I have noticed the following:
when I launch the browser, once any url is entered, the configuration disappears.
however if I don't put any url, it passes the test perfectly.
And even without putting url it is well configured, I attach two images one with url and one without, I don't understand what I can do and I have tried everything.
Object.getOwnPropertyDescriptors (navigator.languages)
it's writable using the languages evasion:
[value] => en-US
[writable] => 1
[enumerable] => 1
[configurable] => 1
while it should be
configurable: false
enumerable: true
value: "es-ES"
writable: false
Image of anti bot test results
Image of anti bot test results
I have managed to keep the specified languages every time a new page is launched, but I still do not resolve the default permissions in a chrome browser:
Object.getOwnPropertyDescriptors (navigator.languages)
while it should be
configurable: false
enumerable: true
value: "es-ES"
writable: false
If anyone knows how to solve this I would appreciate it.
const websites = require('./websites.json')
async function run() {
puppeteer.use(ua);
const optionslaunch = require("./src/scripts/options/optionslaunch");
const browser = await puppeteer.launch(optionslaunch)
const page = await browser.newPage()
// Set the language forcefully on javascript
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, "language", {
get: function () {
return "es-ES";
}
});
Object.defineProperty(navigator, "languages", {
get: function () {
return ["es-ES", "es"];
}
});
});
const pathRequire = path.join(__dirname, 'src/scripts/app.js')
for (const website of websites) {
// require(pathRequire)(page, pageEmail, website)
require(pathRequire)(page, website)
}
}
run().catch(error => { console.error("Something bad happend...", error); });

Clicking a selector with Puppeteer

So I am having trouble clicking a login button on the nike website..
I am not sure why It keeps crashing, well because it can't find the selector I guess but I am not sure what I am doing wrong.
I would like to also say I am having some sort of memory leak before puppeteer crashes and sometimes it will even crash my macbook completely if I don't cancel the process in time inside the console.
EDIT:
This code also causes a memory leak whenever it crashes forcing me to have to hard reset my mac if I don't cancel the application fast enough.
Node Version: 14.4.0
Puppeteer Version: 5.2.1
Current code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized']
})
const page = await browser.newPage()
await page.goto('https://www.nike.com/')
const winner = await Promise.race([
page.waitForSelector('[data-path="join or login"]'),
page.waitForSelector('[data-path="sign in"]')
])
await page.click(winner._remoteObject.description)
})()
I have also tried:
await page.click('button[data-var]="loginBtn"');
Try it:
await page.click('button[data-var="loginBtn"]');
They are A/B testing their website, so you may land on a page with very different selectors than you retreived while you visited the site from your own chrome browser.
In such cases you can try to grab the elements by their text content (unfortunately in this specific case that also changes over the designs) using XPath and its contains method. E.g. $x('//span[contains(text(), "Sign In")]')[0]
So I suggest to detect both button versions and get their most stable selectors, these can be based on data attributes as well:
A
$('[data-path="sign in"]')
B
$('[data-path="join or login"]')
Then with a Promise.race you can detect which button is present and then extract its selector from the JSHandle#node like this: ._remoteObject.description:
{
type: 'object',
subtype: 'node',
className: 'HTMLButtonElement',
description: 'button.nav-btn.p0-sm.body-3.u-bold.ml2-sm.mr2-sm',
objectId: '{"injectedScriptId":3,"id":1}'
}
=>
button.nav-btn.p0-sm.prl3-sm.pt2-sm.pb2-sm.fs12-nav-sm.d-sm-b.nav-color-grey.hover-color-black
Example:
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized']
})
const page = await browser.newPage()
await page.goto('https://www.nike.com/')
const winner = await Promise.race([
page.waitForSelector('[data-path="join or login"]'),
page.waitForSelector('[data-path="sign in"]')
])
await page.click(winner._remoteObject.description)
FYI: Maximize the browser window as well to make sure the elment has the same selector name.
defaultViewport: null, args: ['--start-maximized']
Chromium starts in a bit smaller window with puppeteer by default.
You need to use { waitUntil: 'networkidle0' } with page.goto
This tells puppeteer to wait for the network to be idle (no requests for 500ms)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized']
})
const page = await browser.newPage()
// load the nike.com page and wait for it to fully load (inc A/B scripts)
await page.goto('https://www.nike.com/', { waitUntil: 'networkidle0' })
// select whichever element appears first
var el = await page.waitForSelector('[data-path="join or login"], [data-path="sign in"]', { timeout: 1000 })
// execute click
await page.click(el._remoteObject.description)
})()

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available.
Any idea how I can fix this error and resolve the issue?
I currently have
const browser = await puppeteer.launch({
args: [
"--proxy-server=https=myproxy:myproxyport",
"--no-sandbox",
'--disable-gpu',
"--disable-setuid-sandbox",
],
timeout: 0,
headless: true,
});
page.authentication
The correct format for proxy-server argument is,
--proxy-server=HOSTNAME:PORT
If it's HTTPS proxy, you can pass the username and password using page.authenticate before even doing a navigation,
page.authenticate({username:'user', password:'password'});
Complete code would look like this,
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
ignoreHTTPSErrors:true,
args: ['--no-sandbox','--proxy-server=HOSTNAME:PORT']
});
const page = await browser.newPage();
// Authenticate Here
await page.authenticate({username:user, password:password});
await page.goto('https://www.example.com/');
})();
Proxy Chain
If somehow the authentication does not work using above method, you might want to handle the authentication somewhere else.
There are multiple packages to do that, one is proxy-chain, with this, you can take one proxy, and use it as new proxy server.
The proxyChain.anonymizeProxy(proxyUrl) will take one proxy with username and password, create one new proxy which you can use on your script.
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
const oldProxyUrl = 'http://username:password#hostname:8000';
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
// Prints something like "http://127.0.0.1:12345"
console.log(newProxyUrl);
const browser = await puppeteer.launch({
args: [`--proxy-server=${newProxyUrl}`],
});
// Do your magic here...
const page = await browser.newPage();
await page.goto('https://www.example.com');
})();

Categories

Resources