is there a way to proxy / redirect specific urls to others?
for example when Puppeteer page goes to "mydomain.com" i'd like all calls to "mydomain.com/styles/.css" be proxied to "localhost:8080/styles/.css".
I don't want all request to be rediret through a proxy. but something similar to what https://chrome.google.com/webstore/detail/resource-override/pkoacgokdfckfpndoffpifphamojphii?hl=en chrome extension does.
as #Hellonearthis linked, the soulution seems to be
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
if (request.url().indexOf("mydomain.com") !== -1) {
// simply replace with another url
request.continue((request.url().replace("mydomain.com", "http://localhost:8080/styles"));
}
else {
request.continue();
}
});
For puppeteersharp, it took me a while to figure it out.
page = await ChromeDriver.NewPageAsync();
await page.SetRequestInterceptionAsync(true);
page.Request += new EventHandler<RequestEventArgs>(async delegate (object o, RequestEventArgs a)
{
if (a.Request.Url.StartsWith("mydomain.com"))
{
await a.Request.ContinueAsync(new Payload
{
Headers = a.Request.Headers,
Method = a.Request.Method,
PostData = a.Request.PostData != null ? a.Request.PostData.ToString() : "",
Url = a.Request.Url.Replace("mydomain.com", "http://localhost:8080")
});
}
else
{
await a.Request.ContinueAsync();
}
});
Related
In nextjs server layer (SSR), I have a query that sometimes fail on first try because the backend isn't ready.
const PREVIEW = {
query: MY_PREVIEW_QUERY,
variables: { id, since },
fetchPolicy: "no-cache",
notifyOnNetworkStatusChange: true,
context: {
headers: {
cookie,
},
},
};
const { data, errors } = await apiApolloFetch(
isPreviewRequest ? PREVIEW : STANDARD
);
if (errors) console.warn(errors);
The error only happens when it's a previewRequest. I would like to retry the fetch using ApolloClient. I looked at polling but that does not help me because I only want to poll or retry when there is an error. I also looked RetryLink and it appears to be the answer but I cannot figure out how to use it.
I was able to get something working like the below snippet but I am still looking for an "Apollo" way to do the same thing.
const fetchPreview = async () => {
let resp = await apiApolloFetch(PREVIEW);
if (resp.data.preview == null) {
await delay(2000);
resp = await apiApolloFetch(PREVIEW);
}
return resp;
};
const { data, errors } = isPreviewRequest
? await fetchPreview()
: await apiApolloFetch(STANDARD);
I am currently having the problem (puppeteer) in a project that I think cookies are not activated. But I also don't know how to activate them if they are not already activated from the beginning.
Since every website nowaday has Captcha, so we can skip the auto-login part.
I'm new too, I got an idea from here for this.
Firstly check if there is saved cookies.json file, if not, do the manually login, you click the submit button yourself and solve the captcha puzzle (in non-headless mode), the page should be redirected to destination page.
Once the destination page is loaded, save the cookies in to a Json file for next time.
Example:
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless: false, //launch in non-headless mode so you can see graphics
defaultViewport: null
});
let [page] = await browser.pages();
await page.setRequestInterception(true);
const getCookies = async (page) => {
// Get page cookies
const cookies = await page.cookies()
// Save cookies to file
fs.writeFile('./cookies.json', JSON.stringify(cookies, null, 4), (err) => {
if (err) console.log(err);
return
});
}
const setCookies = async (page) => {
// Get cookies from file as a string
let cookiesString = fs.readFileSync('./cookies.json', 'utf8');
// Parse string
let cookies = JSON.parse(cookiesString)
// Set page cookies
await page.setCookie.apply(page, cookies);
return
}
page.on('request', async (req) => {
// If URL is already loaded in to system
if (req.url() === 'https://example.com/LoggedUserCP') {
console.log('logged in, get the cookies');
await getCookies(page);
// if it's being in login page, try to set existed cookie
} else if (req.url() === 'https://example.com/Login?next=LoggedUserCP') {
await setCookies(page);
console.log('set the saved cookies');
}
// otherwise go to login page and login yourself
req.continue();
});
await page.goto('https://example.com/Login?next=LoggedUserCP');
})();
I am a having issues with implementing generic-pool using puppeteer. Below is my relevant part of the code.
UPDATE
Thanks #Jacob for the help and i am more clear about the concept and how it works and the code is also more readable and clear. I am still having issues where a generic pool is getting created on every request. How do i ensure that the same generic pool is used every time instead of creating new one
browser-pool.js
const genericPool = require('generic-pool');
const puppeteer = require('puppeteer');
class BrowserPool {
static async getPool() {
const browserParams = process.env.NODE_ENV == 'development' ? {
headless: false,
devtools: false,
executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
}
:
{
headless: true,
devtools: false,
executablePath: 'google-chrome-unstable',
args: ['--no-sandbox', '--disable-dev-shm-usage']
};
const factory = {
create: function() {
return puppeteer.launch(browserParams);
},
destroy: function(instance) {
console.log('closing browser in hrere.....');
instance.close();
}
};
const opts = {
max: 5
};
this.myBrowserPool = genericPool.createPool(factory, opts);
}
static async returnPool() {
if (this.myBrowserPool == "") {
getPool();
}
return this.myBrowserPool.acquire();
}
}
BrowserPool.myBrowserPool = null;
module.exports = BrowserPool;
process-export.js
const BrowserPool = require('./browser-pool');
async function performExport(params){
const myPool = BrowserPool.getPool();
const resp = BrowserPool.myBrowserPool.acquire().then(async function(client){
try {
const url = config.get('url');
const page = await client.newPage();
await page.goto(url, {waitUntil: ['networkidle2', 'domcontentloaded']});
let gotoUrl = `${url}/dashboards/${exportParams.dashboardId}?csv_export_id=${exportParams.csvExportId}`;
//more processing
await page.goto(gotoUrl, {waitUntil: 'networkidle2' })
await myPool().myBrowserPool.release(client);
return Data;
} catch(err) {
try {
const l = await BrowserPool.myBrowserPool.destroy(client);
} catch(e) {
}
return err;
}
}).catch(function(err) {
return err;
});
return resp;
}
module.exports.performExport = performExport;
My understanding is that
1) When the application starts I can spin up for example 2 chromium instances and then when ever i want to visit a page i can use either of the two connections, so the browsers are essentially open and we improve the performance since the browser start can take time. is this correct?
2) Where do I place the acquire() code, I understand this should be in the app.js, so we acquire the instances rite when the app boots, but my pupeteer code is in a different file, how do i pass the browser reference in the file which has my pupeteer code.
When I use the above the code, a new browser instances spins up every time and the max property is not considered and it opens up as many instances are requested.
My apologies if its something very trial and i might have not understood the concept fully. Any help in clarifying this would be really helpful.
When using a pool, you'll need to use .acquire() to obtain an object, and then .release() when you're done so the object is returned to the pool and made available to something else. Without using .release(), you'd might as well have no pool at all. I like to use this helper pattern with pools:
class BrowserPool {
// ...
static async withBrowser(fn) {
const pool = BrowserPool.myBrowserPool;
const browser = await pool.acquire();
try {
await fn(browser);
} finally {
pool.release(browser);
}
}
}
This can be used like this anywhere in your code:
await BrowserPool.withBrowser(async browser => {
await browser.doSomeThing();
await browser.doSomeThingElse();
});
The key is the finally clause makes sure that whether your tasks complete or throw an error, you'll cleanly release the browser back to the pool every time.
It sounds like you might have the concept of the max option backwards as well and are expecting the browser instances to be spawned up to max. Rather, max means "only create up to max number of resources." If you try to acquire a sixth resource without anything having been released, for example, the acquire(...) call will block until one item is returned to the pool.
The min option, on the other hand, means "keep at least this many items on hand at all times", which you can use to pre-allocate resources. If you want 5 items to be created in advance, set min to 5. If you want 5 items and only five items to be created, set both min and max to 5.
Update:
I notice in your original code that you destroy in case of error and release when there isn't an error. Still would prefer the benefit of a wrapper function like mine to centralize all resource acquiring/releasing logic (the SRP approach). Here's how it could be updated to automatically destroy on errors instead:
class BrowserPool {
// ...
static async withBrowser(fn) {
const pool = BrowserPool.myBrowserPool;
const browser = await pool.acquire();
try {
await fn(browser);
pool.release(browser);
} catch (err) {
await pool.destroy(browser);
throw err;
}
}
}
Addendum
Figuring out what's going on in your code will be easier if you embrace the async function instead of mixing async function stuff and Promise callback stuff. Here's how it can be rewritten:
async function performExport(params){
const myPool = BrowserPool.myBrowserPool;
const client = await myPool.acquire();
try {
const url = config.get('url');
const page = await client.newPage();
await page.goto(url, {waitUntil: ['networkidle2', 'domcontentloaded']});
let gotoUrl = `${url}/dashboards/${exportParams.dashboardId}?csv_export_id=${exportParams.csvExportId}`;
//more processing
await page.goto(gotoUrl, {waitUntil: 'networkidle2' })
await myPool.release(client);
return Data;
} catch(err) {
try {
const l = await myPool.destroy(client);
} catch(e) {
}
return err; // Are you sure you want to do this? Would suggest throw err.
}
}
We have some routing logic that kicks you to the homepage if you dont have a JWT_TOKEN set... I want to set this before the page loads/before the js is invoked.
how do i do this ?
You have to register localStorage item like this:
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
You should do it after page page.goto - browser must have an url to register local storage item on it. After this, enter the same page once again, this time token should be here before the page is loaded.
Here is a fully working example:
const puppeteer = require('puppeteer');
const http = require('http');
const html = `
<html>
<body>
<div id="element"></div>
<script>
document.getElementById('element').innerHTML =
localStorage.getItem('token') ? 'signed' : 'not signed';
</script>
</body>
</html>`;
http
.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.write(html);
res.end();
})
.listen(8080);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080/');
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
await page.goto('http://localhost:8080/');
const text = await page.evaluate(
() => document.querySelector('#element').textContent
);
console.log(text);
await browser.close();
process.exit(0);
})();
There's some discussion about this in Puppeteer's GitHub issues.
You can load a page on the domain, set your localStorage, then go to the actual page you want to load with localStorage ready. You can also intercept the first url load to return instantly instead of actually load the page, potentially saving a lot of time.
const doSomePuppeteerThings = async () => {
const url = 'http://example.com/';
const browser = await puppeteer.launch();
const localStorage = { storageKey: 'storageValue' };
await setDomainLocalStorage(browser, url, localStorage);
const page = await browser.newPage();
// do your actual puppeteer things now
};
const setDomainLocalStorage = async (browser, url, values) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.',
});
});
await page.goto(url);
await page.evaluate(values => {
for (const key in values) {
localStorage.setItem(key, values[key]);
}
}, values);
await page.close();
};
in 2021 it work with following code:
// store in localstorage the token
await page.evaluateOnNewDocument (
token => {
localStorage.clear();
localStorage.setItem('token', token);
}, 'eyJh...9_8cw');
// open the url
await page.goto('http://localhost:3000/Admin', { waitUntil: 'load' });
The next line from the first comment does not work unfortunately
await page.evaluate(() => {
localStorage.setItem('token', 'example-token'); // not work, produce errors :(
});
Without requiring to double goTo this would work:
const browser = await puppeteer.launch();
browser.on('targetchanged', async (target) => {
const targetPage = await target.page();
const client = await targetPage.target().createCDPSession();
await client.send('Runtime.evaluate', {
expression: `localStorage.setItem('hello', 'world')`,
});
});
// newPage, goTo, etc...
Adapted from the lighthouse doc for puppeteer that do something similar: https://github.com/GoogleChrome/lighthouse/blob/master/docs/puppeteer.md
Try and additional script tag. Example:
Say you have a main.js script that houses your routing logic.
Then a setJWT.js script that houses your token logic.
Then within your html that is loading these scripts order them in this way:
<script src='setJWT.js'></script>
<script src='main.js'></script>
This would only be good for initial start of the page.
Most routing libraries, however, usually have an event hook system that you can hook into before a route renders. I would store the setJWT logic somewhere in that callback.
I am using Chrome stable 60 (https://chromedevtools.github.io/devtools-protocol/1-2/Page/) for headless. I need to be able to do this:
Navigate to page 1
Take screenshot1
Navigate to page 2 (after page 1 is done)
Take screenshot2
However, I can't see to call Page.navigate twice because Page.loadEventFired will pick up on the latest one.
I don't want to use Canary because it's so unstable (screenshot doesn't even work right). So I think Target isn't an option (if it could be).
What is the best way to do url navigation in serial fashion like that?
I looked at https://github.com/LucianoGanga/simple-headless-chrome to see how they do it (await mainTab.goTo) but can't seem to figure out yet.
The link here https://github.com/cyrus-and/chrome-remote-interface/issues/92 gave me some idea:
const fs = require('fs');
const CDP = require('chrome-remote-interface');
function loadForScrot(url) {
return new Promise(async (fulfill, reject) => {
const tab = await CDP.New();
const client = await CDP({tab});
const {Page} = client;
Page.loadEventFired(() => {
fulfill({client, tab});
});
await Page.enable();
await Page.navigate({url});
});
}
async function process(urls) {
try {
const handlers = await Promise.all(urls.map(loadForScrot));
for (const {client, tab} of handlers) {
const {Page} = client;
await CDP.Activate({id: tab.id});
const filename = `/tmp/scrot_${tab.id}.png`;
const result = await Page.captureScreenshot();
const image = Buffer.from(result.data, 'base64');
fs.writeFileSync(filename, image);
console.log(filename);
await client.close();
}
} catch (err) {
console.error(err);
}
}
process(['http://example.com',
'http://example.com',
'http://example.com',
'http://example.com',
'http://example.com',
'http://example.com',
'http://example.com',
'http://example.com']);
Checkout the new library from google team puppeteer