The npm puppeter package is returning an error regarding node_package space - javascript

Here is the error that it is returning
Here is a picture of the error i'm getting when I run npm install puppeter
I found some stuff online with permissions, but this is about node_package space. It is not disk space as I've looked over my disk storage availability and there's plenty. Working on an Apify SDK, and I'm following the documentation, but the console is returning a whole bunch of error messages.
Can someone please help?
const Apify = require('apify')
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: 'https://www.iana.org/' });
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
handlePageFunction: async ({ request, page }) => {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
await Apify.utils.enqueueLinks({
requestQueue,
page,
pseudoUrls: ['https://www.iana.org/[.*]'],
});
},
});
await crawler.run();
});

Related

How to use proxy servers with aws lambda + Puppeteer?

I am trying to run puppeteer with proxy chain package on aws lambda but I am getting this error message:
"errorType": "Error",
"errorMessage": "Protocol error (Target.createTarget): Target closed.",
Code:
const chromium = require('chrome-aws-lambda');
const { addExtra } = require("puppeteer-extra");
const puppeteerExtra = addExtra(chromium.puppeteer);
const proxyChain = require('proxy-chain');
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteerExtra.use(StealthPlugin());
exports.handler = async (event, context, callback) => {
let finalResult = [];
const url = ``;
let browser;
const oldProxyUrl = ''; // --> bright data proxy
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
console.log("newProxyUrl", newProxyUrl)
try {
browser = await puppeteerExtra.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox', `--proxy-server=${newProxyUrl}`],
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless
});
const page = await browser.newPage();
await page.goto(url);
finalResult = await extractElements(page);
} catch (error) {
return callback(error);
} finally {
await browser.close();
}
return callback(null, finalResult);
};
Above code works fine on aws lambda without proxy-server url. I also tested same code without proxy server url on serverless functions like vercel and netlify and it worked. Only issue is when I add proxy server url it throws protocol error.
Here are a few things you can try to troubleshoot this issue:
Make sure that the url variable has a value. This is currently an empty string, which means that the page.goto() method will not have a valid URL to navigate to.
Make sure that the oldProxyUrl variable has a value. This is currently an empty string, which means that the proxyChain.anonymizeProxy() method will not have a valid proxy to anonymize.
Make sure that the extractElements() function is defined and can be called. This function is not present in the code you provided, so you may need to include it or modify the code to remove the call to this function.
Check the logs of your AWS Lambda function to see if there are any additional error messages that might provide more information about the issue.
Check the documentation for the puppeteer-extra-plugin-stealth and proxy-chain packages to see if there are any known issues or compatibility issues with AWS Lambda.

Is there a way to open multiple tabs simultaneously on Playwright or Puppeteer to complete the same tasks?

I just started coding, and I was wondering if there was a way to open multiple tabs concurrently with one another. Currently, my code goes something like this:
const puppeteer = require("puppeteer");
const rand_url = "https://www.google.com";
async function initBrowser() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(rand_url);
await page.setViewport({
width: 1200,
height: 800,
});
return page;
}
async function login(page) {
await page.goto("https://www.google.com");
await page.waitFor(100);
await page.type("input[id ='user_login'", "xxx");
await page.waitFor(100);
await page.type("input[id ='user_password'", "xxx");
}
this is not my exact code, replaced with different aliases, but you get the idea. I was wondering if there was anyone out there that knows the code that allows this same exact browser to be opened on multiple instances, replacing the respective login info only. Of course, it would be great to prevent my IP from getting banned too, so if there was a way to apply proxies to each respective "browser"/ instance, that would be perfect.
Lastly, I would like to know whether or not playwright or puppeteer is superior in the way they can handle these multiple instances. I don't even know if this is a possibility, but please enlighten me. I want to learn more.
You can use multiple browser window as different login/cookies.
For simplicity, you can use the puppeteer-cluster module by Thomas Dondorf.
This module can make your puppeteer launched and queued one by one so that you can use this to automating your login, and even save login cookies for the next launches.
Feel free to go to the Github: https://github.com/thomasdondorf/puppeteer-cluster
const { Cluster } = require('puppeteer-cluster')
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // <= this is the number of
// parallel task running simultaneously
}) // You can change to the number of CPU
const cpuNumber = require('os').cpus().length // for example
await cluster.task(async ({ page, data: [username, password] }) => {
await page.goto('https://www.example.com')
await page.waitForTimeout(100)
await page.type('input[id ="user_login"', username)
await page.waitForTimeout(100)
await page.type('input[id ="user_password"', password)
const screen = await page.screenshot()
// Store screenshot, Save Cookies, do something else
});
cluster.queue(['myFirstUsername', 'PassW0Rd1'])
cluster.queue(['anotherUsername', 'Secr3tAgent!'])
// cluster.queue([username, password])
// username and password array passed into cluster task function
// many more pages/account
await cluster.idle()
await cluster.close()
})()
For Playwright, sadly still unsupported by the module above,you can use browser pool (cluster) module to automating the Playwright launcher.
And for proxy usage, I recommend Puppeteer library as the legendary one.
Don't forget to choose my answer as the right one, if this helps you.
There are profiling and proxy options; you could combine them to achieve your goal:
Profile, https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context
import { chromium } from 'playwright'
const userDataDir = /tmp/ + process.argv[2]
const browserContext = await chromium.launchPersistentContext(userDataDir)
// ...
Proxy, https://playwright.dev/docs/api/class-browsertype#browser-type-launch
import { chromium } from 'playwright'
const proxy = { /* secret */ }
const browser = await chromium.launch({
proxy: { server: 'pre-context' }
})
const browserContext = await browser.newContext({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`,
username: proxy.username,
password: proxy.password,
}
})
// ...

API calls served over HTTP instead of HTTPs results in error in React/Axios

I'm using an API to fetch movie data using axios in my React app. While this works in localhost, I've recently uploaded this to github pages where it no longer works and it results in this error.
"Mixed content: load all resources via HTTPS to improve the security of your site"
My code is shown below:
const fetchItems = async () => {
const result = await axios(
`http://www.omdbapi.com/?s=${searchTitle}&type=movie&page=${searchPage}&y=${searchYear}&apikey=myapikeyhere`
);
if (result.data.totalResults) {
console.log("fetching data:", result.data.Search);
setQueryLength(result.data.Search.length);
setMovieQuery(result.data.Search);
} else {
setMovieQuery([]);
setQueryLength(0);
}
setLoading(false);
};
This code is contained within a useEffect hook, when a user enters data into a text field movies are supposed to appear corresponding to the movie title. Nothing is being printed in my github pages site and I'm getting the error detailed above. I've never actually encountered this error before and I look forward to getting some feedback
You have to write https and not http
const result = await axios(
`https://www.omdbapi.com/?s=${searchTitle}&type=movie&page=${searchPage}&y=${searchYear}&apikey=myapikeyhere`
);
You need to use https in your endpoint link
const fetchItems = async () => {
const result = await axios(
`https://www.omdbapi.com/?s=${searchTitle}&type=movie&page=${searchPage}&y=${searchYear}&apikey=myapikeyhere`
);
if (result.data.totalResults) {
console.log("fetching data:", result.data.Search);
setQueryLength(result.data.Search.length);
setMovieQuery(result.data.Search);
} else {
setMovieQuery([]);
setQueryLength(0);
}
setLoading(false);
};

GCloud Function with Puppeteer - 'Process exited with code 16' error

Having trouble finding any documentation or cause for this sort of issue. I'm trying to run a headless chrome browser script that pulls the the current song playing from kexp.org and returns it as a JSON object. Testing with the NPM package #Google-clound/functions-framework does return the correct response however when deployed into GCloud, I receive the following error when hitting the API trigger:
Error: could not handle the request
Error: Process exited with code 16
at process.on.code (invoker.js:396)
at process.emit (events.js:198)
at process.EventEmitter.emit (domain.js:448)
at process.exit (per_thread.js:168)
at logAndSendError (/workspace/node_modules/#google-cloud/functions framework/build/src/invoker.js:184)
at process.on.err (invoker.js:393)
at process.emit (events.js:198)
at process.EventEmitter.emit (domain.js:448)
at emitPromiseRejectionWarnings (internal/process/promises.js:140)
at process._tickCallback (next_tick.js:69)
Full Script:
const puppeteer = require('puppeteer');
let browserPromise = puppeteer.launch({
args: [
'--no-sandbox'
]
})
exports.getkexp = async (req, res) => {
const browser = await browserPromise
const context = await browser.createIncognitoBrowserContext()
const page = await context.newPage()
try {
const url = 'https://www.kexp.org/'
await page.goto(url)
await page.waitFor('.Player-meta')
let content = await page.evaluate(() => {
// finds elements by data type and maps to array note: needs map because puppeeter needs a serialized element
let player = [...document.querySelectorAll('[data-player-meta]')].map((player) =>
// cleans up and removes empty strings from array
player.innerHTML.trim());
// creates object and removes empty strings
player = {...player.filter(n => n)}
let songList = {
"show":player[0],
"artist":player[1],
"song":player[2].substring(2),
"album":player[3]
}
return songList
});
context.close()
res.set('Content-Type', 'application/json')
res.status(200).send(content)
} catch (e) {
console.log('error occurred: '+e)
context.close()
res.set('Content-Type', 'application/json')
res.status(200).send({
"error":"occurred"
})
}
}
Is there documentation for this error type? It's been deployed on GCloud via CLI shell with the following parameters:
gcloud functions deploy getkexp --trigger-http --runtime=nodejs10 --memory=1024mb

Web scraping with Nightmare Cloud function works locally, but not when deployed

I am trying to scrape a JavaScript calendar and return a JSON array of its events in Google Cloud Functions (Blaze plan). The below function works, but only when ran locally through the Firebase emulators. It deploys successfully, but every call causes a timeout. No errors were thrown in the logs or anything. Both the local and deployed functions running on Node.js 10. (EDIT: I found this article mentioning xvfb is required to use Nightmare without a display, but I'm not sure how I would add that to Firebase, or even install it)
const functions = require('firebase-functions');
const Nightmare = require('nightmare'); //latest version
const retrieveEventsOpts = { memory: "2GB", timeoutSeconds: 60 };
exports.retrieveEventsArray = functions.runWith(retrieveEventsOpts).https.onRequest(async (request, response) => {
nightmare = Nightmare({show: false})
try {
await nightmare
.goto('https://www.csbcsaints.org/calendar')
.evaluate(() => document.querySelector('body').innerHTML )
.then(firstResponse => {
let responseJSON = parseHTMLForEvents(firstResponse) //Just a function that synchronously parses the HTML string to a JSON array
return response.status(200).json(responseJSON)
}).catch(error => {
return response.status(500).json(error)
})
} catch(error) {
return response.status(500).json(error)
}
})

Categories

Resources