I have a problem that I can't solve, to finish making anonymous puppeteer.
So far I have passed all the anti-bot tests, but I can't configure the language, let me explain:
Overwriting the user agent, I manage to change the "navigator.language" from "en-US, in" to "es-ES, es"
But I have tried everything and I am not able to overwrite the "navigator.languages" it always remains in "en-US, en"
I hope there is someone who can help me change the languages.
I attach screenshots and link of the plugin I use.
https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth
https://github.com/berstend/puppeteer-extra/blob/master/packages/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js
const puppeteer = require("puppeteer-extra");
// add stealth plugin and use defaults (all evasion techniques)
const stealth_plugin = require("puppeteer-extra-plugin-stealth");
const stealth = stealth_plugin();
puppeteer.use(stealth);
const UserAgentOverride = require("puppeteer-extra-plugin-stealth/evasions/user-agent-override");
const ua = UserAgentOverride({locale: "es-ES,es;q=0.9", userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36", platform: "MacIntel"});
const path = require('path')
const websites = require('./websites.json')
async function run() {
puppeteer.use(ua);
const browser = await puppeteer.launch({
headless: false,
userDataDir: "./cache",
ignoreHTTPSErrors: true,
ignoreDefaultArgs: [
"--disable-extensions",
"--enable-automation",
],
args: [
"--lang=es-ES,es;q=0.9",
"--no-sanbox",
"--disable-dev-shm-usage",
"--disable-gpu"
]
})
console.log(await browser.userAgent());
const page = await browser.newPage()
const pathRequire = path.join(__dirname, 'src/scripts/index.js')
for (const website of websites) {
require(pathRequire)(page, website)
}
}
run().catch(error => { console.error("Something bad happend...", error); });
Image of anti bot test results:
Hi there
Thanks for the answer, after testing the edited code, I have noticed the following:
when I launch the browser, once any url is entered, the configuration disappears.
however if I don't put any url, it passes the test perfectly.
And even without putting url it is well configured, I attach two images one with url and one without, I don't understand what I can do and I have tried everything.
Object.getOwnPropertyDescriptors (navigator.languages)
it's writable using the languages evasion:
[value] => en-US
[writable] => 1
[enumerable] => 1
[configurable] => 1
while it should be
configurable: false
enumerable: true
value: "es-ES"
writable: false
Image of anti bot test results
Image of anti bot test results
I have managed to keep the specified languages every time a new page is launched, but I still do not resolve the default permissions in a chrome browser:
Object.getOwnPropertyDescriptors (navigator.languages)
while it should be
configurable: false
enumerable: true
value: "es-ES"
writable: false
If anyone knows how to solve this I would appreciate it.
const websites = require('./websites.json')
async function run() {
puppeteer.use(ua);
const optionslaunch = require("./src/scripts/options/optionslaunch");
const browser = await puppeteer.launch(optionslaunch)
const page = await browser.newPage()
// Set the language forcefully on javascript
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, "language", {
get: function () {
return "es-ES";
}
});
Object.defineProperty(navigator, "languages", {
get: function () {
return ["es-ES", "es"];
}
});
});
const pathRequire = path.join(__dirname, 'src/scripts/app.js')
for (const website of websites) {
// require(pathRequire)(page, pageEmail, website)
require(pathRequire)(page, website)
}
}
run().catch(error => { console.error("Something bad happend...", error); });
Related
I just started coding, and I was wondering if there was a way to open multiple tabs concurrently with one another. Currently, my code goes something like this:
const puppeteer = require("puppeteer");
const rand_url = "https://www.google.com";
async function initBrowser() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(rand_url);
await page.setViewport({
width: 1200,
height: 800,
});
return page;
}
async function login(page) {
await page.goto("https://www.google.com");
await page.waitFor(100);
await page.type("input[id ='user_login'", "xxx");
await page.waitFor(100);
await page.type("input[id ='user_password'", "xxx");
}
this is not my exact code, replaced with different aliases, but you get the idea. I was wondering if there was anyone out there that knows the code that allows this same exact browser to be opened on multiple instances, replacing the respective login info only. Of course, it would be great to prevent my IP from getting banned too, so if there was a way to apply proxies to each respective "browser"/ instance, that would be perfect.
Lastly, I would like to know whether or not playwright or puppeteer is superior in the way they can handle these multiple instances. I don't even know if this is a possibility, but please enlighten me. I want to learn more.
You can use multiple browser window as different login/cookies.
For simplicity, you can use the puppeteer-cluster module by Thomas Dondorf.
This module can make your puppeteer launched and queued one by one so that you can use this to automating your login, and even save login cookies for the next launches.
Feel free to go to the Github: https://github.com/thomasdondorf/puppeteer-cluster
const { Cluster } = require('puppeteer-cluster')
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // <= this is the number of
// parallel task running simultaneously
}) // You can change to the number of CPU
const cpuNumber = require('os').cpus().length // for example
await cluster.task(async ({ page, data: [username, password] }) => {
await page.goto('https://www.example.com')
await page.waitForTimeout(100)
await page.type('input[id ="user_login"', username)
await page.waitForTimeout(100)
await page.type('input[id ="user_password"', password)
const screen = await page.screenshot()
// Store screenshot, Save Cookies, do something else
});
cluster.queue(['myFirstUsername', 'PassW0Rd1'])
cluster.queue(['anotherUsername', 'Secr3tAgent!'])
// cluster.queue([username, password])
// username and password array passed into cluster task function
// many more pages/account
await cluster.idle()
await cluster.close()
})()
For Playwright, sadly still unsupported by the module above,you can use browser pool (cluster) module to automating the Playwright launcher.
And for proxy usage, I recommend Puppeteer library as the legendary one.
Don't forget to choose my answer as the right one, if this helps you.
There are profiling and proxy options; you could combine them to achieve your goal:
Profile, https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context
import { chromium } from 'playwright'
const userDataDir = /tmp/ + process.argv[2]
const browserContext = await chromium.launchPersistentContext(userDataDir)
// ...
Proxy, https://playwright.dev/docs/api/class-browsertype#browser-type-launch
import { chromium } from 'playwright'
const proxy = { /* secret */ }
const browser = await chromium.launch({
proxy: { server: 'pre-context' }
})
const browserContext = await browser.newContext({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`,
username: proxy.username,
password: proxy.password,
}
})
// ...
I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?
You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions
You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();
I am having a strange issue with Puppeteer. Current cookie setting code as follows:
(Save cookie)
const cookies = await page.cookies();
await checkMongoConnection();
account.cookies = JSON.stringify(cookies, null, null);
await account.save();
await closeMongoConnection();
(Load Cookie)
const options = {
headless: true,
defaultViewport: { width: 1366, height: 768 },
ignoreHTTPSErrors: true,
args: [
'--disable-sync',
'--ignore-certificate-errors'
],
ignoreDefaultArgs: ['--enable-automation']
};
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
// Cookies
if (account.cookies) {
// I have checked this with console.log it does contain cookies
const cookies = JSON.parse(account.cookies);
await page.setCookie(...cookies);
}
await page.goto('https://www.some-website.com');
This works without any issues when run on macOS (both with headless set to false and true), also note I am using Chromium.
However when I try to run this setup on my Linux Ubuntu server just setting the cookie has no affect. Has anyone else come across this issue before? Any ideas what I might be doing wrong here?
When I log the cookies I get from the database I get something like:
[
{
name: 'personalization_id',
value: '"v1_FijCjdT7iRj3K+cbhPiPIg=="',
domain: '.somedomain.com',
path: '/',
expires: 1664967308.337118,
size: 47,
httpOnly: false,
secure: true,
session: false,
sameSite: 'None'
}, ....
I have tried logging the cookies after they are set:
// Cookies
if (account.cookies) {
const cookies = JSON.parse(account.cookies);
await page.setCookie(...cookies);
}
console.log('check cookies');
const newCookies = await page.cookies();
console.log(newCookies);
This just results in an empty array [], so it seems they are refusing to set.
Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available.
Any idea how I can fix this error and resolve the issue?
I currently have
const browser = await puppeteer.launch({
args: [
"--proxy-server=https=myproxy:myproxyport",
"--no-sandbox",
'--disable-gpu',
"--disable-setuid-sandbox",
],
timeout: 0,
headless: true,
});
page.authentication
The correct format for proxy-server argument is,
--proxy-server=HOSTNAME:PORT
If it's HTTPS proxy, you can pass the username and password using page.authenticate before even doing a navigation,
page.authenticate({username:'user', password:'password'});
Complete code would look like this,
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
ignoreHTTPSErrors:true,
args: ['--no-sandbox','--proxy-server=HOSTNAME:PORT']
});
const page = await browser.newPage();
// Authenticate Here
await page.authenticate({username:user, password:password});
await page.goto('https://www.example.com/');
})();
Proxy Chain
If somehow the authentication does not work using above method, you might want to handle the authentication somewhere else.
There are multiple packages to do that, one is proxy-chain, with this, you can take one proxy, and use it as new proxy server.
The proxyChain.anonymizeProxy(proxyUrl) will take one proxy with username and password, create one new proxy which you can use on your script.
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
const oldProxyUrl = 'http://username:password#hostname:8000';
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
// Prints something like "http://127.0.0.1:12345"
console.log(newProxyUrl);
const browser = await puppeteer.launch({
args: [`--proxy-server=${newProxyUrl}`],
});
// Do your magic here...
const page = await browser.newPage();
await page.goto('https://www.example.com');
})();
So I just started work on protractor tests and I'm facing the following problem - my tests fail inconsistently. Sometimes the test may pass and the next time it fails. Reasons to fail is very different, it may because it failed to find an element on a page or element does not have text in it (even if it has).
I'm running on Ubuntu 14.04, the same problem relevant for Chrome Version 71.0.3578.80 and Firefox Version 60.0.2. AngularJS Version 1.7.2 and Protractor Version 5.4.0. I believe the problem is somewhere in my code, so here below I provided an example of an existing code base.
Here is my protractor config
exports.config = {
rootElement: '[ng-app="myapp"]',
framework: 'jasmine',
seleniumAddress: 'http://localhost:4444/wd/hub',
specs: ['./e2e/**/*protractor.js'],
SELENIUM_PROMISE_MANAGER: false,
baseUrl: 'https://localhost/',
allScriptsTimeout: 20000,
jasmineNodeOpts: {
defaultTimeoutInterval: 100000,
},
capabilities: {
browserName: 'firefox',
marionette: true,
acceptInsecureCerts: true,
'moz:firefoxOptions': {
args: ['--headless'],
},
},
}
And here capabilities for chrome browser
capabilities: {
browserName: 'chrome',
chromeOptions: {
args: [ "--headless", "--disable-gpu", "--window-size=1920,1080" ]
}
},
And finally, my test kit that failed a few times
const InsurerViewDriver = require('./insurer-view.driver');
const InsurerRefundDriver = require('./insurer-refund.driver');
const { PageDriver } = require('#utils/page');
const { NotificationsDriver } = require('#utils/service');
const moment = require('moment');
describe(InsurerViewDriver.pageUrl, () => {
beforeAll(async () => {
await InsurerViewDriver.goToPage();
});
it('- should test "Delete" button', async () => {
await InsurerViewDriver.clickDelete();
await NotificationsDriver.toBeShown('success');
await PageDriver.userToBeNavigated('#/setup/insurers');
await InsurerViewDriver.goToPage();
});
describe('Should test Refunds section', () => {
it('- should test refund list content', async () => {
expect(await InsurerRefundDriver.getTitle()).toEqual('REFUNDS');
const refunds = InsurerRefundDriver.getRefunds();
expect(await refunds.count()).toBe(1);
const firstRow = refunds.get(0);
expect(await firstRow.element(by.binding('item.name')).getText()).toEqual('Direct');
expect(await firstRow.element(by.binding('item.amount')).getText()).toEqual('$ 50.00');
expect(await firstRow.element(by.binding('item.number')).getText()).toEqual('');
expect(await firstRow.element(by.binding('item.date')).getText()).toEqual(moment().format('MMMM DD YYYY'));
});
it('- should test add refund action', async () => {
await InsurerRefundDriver.openNewRefundForm();
const NewRefundFormDriver = InsurerRefundDriver.getNewRefundForm();
await NewRefundFormDriver.setPayment(`#555555, ${moment().format('MMMM DD YYYY')} (amount: $2,000, rest: $1,500)`);
await NewRefundFormDriver.setPaymentMethod('Credit Card');
expect(await NewRefundFormDriver.getAmount()).toEqual('0');
await NewRefundFormDriver.setAmount(200.05);
await NewRefundFormDriver.setAuthorization('qwerty');
await NewRefundFormDriver.submit();
await NotificationsDriver.toBeShown('success');
const interactions = InsurerRefundDriver.getRefunds();
expect(await interactions.count()).toBe(2);
expect(await InsurerViewDriver.getInsurerTitleValue('Balance:')).toEqual('Balance: $ 2,200.05');
expect(await InsurerViewDriver.getInsurerTitleValue('Wallet:')).toEqual('Wallet: $ 4,799.95');
});
});
});
And here some functions from driver's, that I'm referencing in the test above
// PageDriver.userToBeNavigated
this.userToBeNavigated = async function(url) {
return await browser.wait(
protractor.ExpectedConditions.urlContains(url),
5000,
`Expectation failed - user to be navigated to "${url}"`
);
};
this.pageUrl = '#/insurer/33';
// InsurerViewDriver.goToPage
this.goToPage = async () => {
await browser.get(this.pageUrl);
};
// InsurerViewDriver.clickDelete()
this.clickDelete = async () => {
await $('[ng-click="$ctrl.removeInsurer()"]').click();
await DialogDriver.toBeShown('Are you sure you want to remove this entry?');
await DialogDriver.confirm();
};
// NotificationsDriver.toBeShown
this.toBeShown = async (type, text) => {
const awaitSeconds = 6;
return await browser.wait(
protractor.ExpectedConditions.presenceOf(
text ? element(by.cssContainingText('.toast-message', text)) : $(`.toast-${type}`)
),
awaitSeconds * 1000,
`${type} notification should be shown within ${awaitSeconds} sec`
);
}
// InsurerRefundDriver.getRefunds()
this.getRefunds = () => $('list-refunds-component').all(by.repeater('item in $data'));
// InsurerViewDriver.getInsurerTitleValue
this.getInsurerTitleValue = async (text) => {
return await element(by.cssContainingText('header-content p', text)).getText();
};
I can't upload the whole code here to give you better understanding because I have a lot of code till this moment, but the code provided above is the exact sample of approach I'm using everywhere, does anyone see a problem in my code? Thanks.
First of all add this block before exporting your config
process.on("unhandledRejection", ({message}) => {
console.log("\x1b[36m%s\x1b[0m", `Unhandled rejection: ${message}`);
});
this essentially colorfully logs to the console if you missed async/await anywhere, and it'll give confidence that you didn't miss anything.
Second, I would install "protractor-console" plugin, to make sure there is no errors/rejections in the browser console (i.e. exclude possibility of issues from your app side) and add to your config
plugins: [{
package: "protractor-console",
logLevels: [ "severe" ]
}]
Then the next problem that I would expect with these signs is incorrect waiting functions. Ideally you have to test them separately as you develop your e2e project, but since it's all written already I'll tell you how I debugged them. Note, this approach won't probably help you if your actions are less than a sec (i.e. you can't notice them). Otherwise follow this chain.
1) I created run configuration in WebStorm, as described in my comment here (find mine) How to debug angular protractor tests in WebStorm
2) Set a breakpoint in the first line of the test I want to debug
3) Then execute your test line by line, using the created run config.
When you start debugging process, webstorm opens up a panel with three sections: frames, console, variables. When the variables section has a message connected to localhost and no variables listed, this means your step is still being executed. Once loading completed you can see all your variables and you can execute next command. So the main principle here is you click Step Over button and watch for variables section. IF VARIABLES APPEAR BEFORE THE APPS LOADING COMPLETED (the waiting method executed, but the app is still loading, which is wrong) then you need to work on this method. By going this way I identified a lot of gaps in my custom waiting methods.
And finally if this doesn't work, please attach stack trace of your errors and ping me
I'm concerned about this code snippet
describe(InsurerViewDriver.pageUrl, () => {
beforeAll(async () => {
await InsurerViewDriver.goToPage();
});
it('- should test "Delete" button', async () => {
await InsurerViewDriver.clickDelete();
await NotificationsDriver.toBeShown('success');
await PageDriver.userToBeNavigated('#/setup/insurers');
await InsurerViewDriver.goToPage(); // WHY IS THIS HERE?
});
describe('Should test Refunds section', () => {
it('- should test refund list content', async () => {
// DOESN'T THIS NEED SOME SETUP?
expect(await InsurerRefundDriver.getTitle()).toEqual('REFUNDS');
// <truncated>
You should not depend on the first it clause to set up the suite below it. You didn't post the code for InsurerRefundDriver.getTitle() but if that code does not send the browser to the correct URL and then wait for the page to finish loading, it is a problem. You should probably have await InsurerViewDriver.goToPage(); in a beforeEach clause.
After some time research I found what was the problem. The cause was the way I'm navigate through the app.
this.goToPage = async () => {
await browser.get(this.pageUrl);
};
Turns out, that browser.get method is being resolved when url changed, but now when angularjs done compile. I used the same approach in every test kit, that's why my tests were failing inconsistently, sometimes page was not fully loaded before test start.
So here is an approach that did the trick
this.goToPage = async () => {
await browser.get(this.pageUrl);
await browser.wait(EC.presenceOf(`some important element`), 5000, 'Element did not appear after route change');
};
You should ensure that page done all the compiling job before moving on.
It seems this could be due to asynchronous javascript.
browser.ignoreSynchronization = true; has a global effect for all your tests. you may have to set it back to false, so protractor waits for angular to be finished rendering the page. e.g. in or before your second beforeEach function