Custom Function Not Defined Puppeteer - javascript

I made this custom function and put it outside globally which normally would work. I also tried moving it inside the main async puppeteer function but also doesn't work. Its a simple function. In each page evaluate function I call this and pass the selector. But, its saying not defined and promise rejection which is weird because the function isn't a promise....Please help
const grabDomConvertNodlistToArray = (grabDomHtmlPath) => {
// grabbing node list from html selector all
const nList = document.querySelectorAll(grabDomHtmlPath);
// converting nodelist to array to be returned
const array = Array.from(nList);
return array;
};
I tried turning the function into an async function adding a new parameter page. I then added async to my evaluate function and then passes the puppeteer page as an argument and still errors and not working.
const grabDomConvertNodlistToArray = async (page, grabDomHtmlPath) => {
try {
// grabbing node list from html selector all
const nList = await page.document.querySelectorAll(grabDomHtmlPath);
// converting nodelist to array to be returned
const array = Array.from(nList);
return array;
} catch (error) {
console.log(error);
}
};
So I have your typical puppeteer setup where you awai browser.newPage() then you goto(url). Then i added this;
await page.exposeFunction("grabDomConvertNodlistToArray", grabDomConvertNodlistToArray);
added async to my evaluate callback function aka async() => {}. But still when calling my custom function inside the above evaluate function it doesn't work for some reason.
Found A Solution But, It Doesn't Work For Me. I'm Getting array.forEach is not a method which indicates to me that inside my grabDomConvertNodlistToArray function its not grabbing the nodeList or converting it into an array. If it did then forEach would be a function.
Solution 3
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(someURL);
var functionToInject = function(){
return 1+1;
}
var otherFunctionToInject = function(input){
return 6
}
await page.exposeFunction("functionToInject", functionToInject)
await page.exposeFunction("otherFunctionToInject", otherFunctionToInject)
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return await functionToInject() + await otherFunctionToInject();
});
return data
So erase the two functions above and convert it to use my function below.
const grabDomConvertNodlistToArray = (grabDomHtmlPath) => {
// grabbing node list from html selector all
const nList = document.querySelectorAll(grabDomHtmlPath);
// converting nodelist to array to be returned
const array = Array.from(nList);
return array;
};
Running my js file results in an error of array.forEach isn't a function which is weird because if the function worked as intended the const array inside my evaluate function would be an array because its = to the above function which is returning an array. So.....idk whats going on think it has something to do with the document.querySelectorAll() line.
const rlData = async () => {
const browser = await puppeteer.launch(
{
headless: true,
},
{
args: ["--flag-switches-begin", "--disable-features=OutOfBlinkCors", "--flag-switches-end"],
}
);
const pageBodies = await browser.newPage();
await pageBodies.goto("https://test.com/bodies", {
waitUntil: "load",
});
const grabDomConvertNodlistToArray = (grabDomHtmlPath) => {
// grabbing node list from html selector all
const nList = document.querySelectorAll(grabDomHtmlPath);
// converting nodelist to array to be returned
const array = Array.from(nList);
return array;
};
await pageBodies.exposeFunction("grabDomConvertNodlistToArray", grabDomConvertNodlistToArray);
const rlBodyNames = await pageBodies.evaluate(async () => {
// grabs all elements in html to make nodelist & converts it to an array
const array = grabDomConvertNodlistToArray(".testbodies > div > h1");
// push the data collected from array into data array and returned
const data = [];
array.forEach((element) => {
data.push(element.textContent);
});
return data;
});
}
rlData();
Guess I'm going to have to move the document.querySelectorAll functionality out of the custom function and back in the evaluate. However, the whole reason of making that custom function was to reduce the same code being used multiple times since my overall crawler is 238 lines long with a lot of repetitiveness. Not being able to call custom functions like mine is horrible for refactoring same code executions.
I gave up trying to get this to work and decided just to do it this way. Yeah it makes your code repetitive if you have more pages to scrape so you will be using the same code many times which is what I was trying to avoid but, puppeteer is the worse for refactoring your code maybe down the line the developers of said package will add the ability to easily use custom functions like how I was trying too.
const testNames = await pageBodies.evaluate(() => {
const nodeList = document.querySelectorAll(".test > div h2");
const array = Array.from(nodeList);
const data = [];
array.forEach((element) => {
data.push(element.textContent);
});
return data;
});

exposeFunction() is not suitable for your case: the exposed function is intended to tranfer data between browser and Node.js contexts so it can be wrapped under the hood in a code that serialize and deserialize arguments and returned data and some unserializable data (as DOM elements) can be lost. Try this instead:
const rlData = async () => {
const browser = await puppeteer.launch(
{
headless: true,
},
{
args: ["--flag-switches-begin", "--disable-features=OutOfBlinkCors", "--flag-switches-end"],
}
);
const pageBodies = await browser.newPage();
await pageBodies.evaluateOnNewDocument(() => {
window.grabDomConvertNodlistToArray = function grabDomConvertNodlistToArray(grabDomHtmlPath) {
// grabbing node list from html selector all
const nList = document.querySelectorAll(grabDomHtmlPath);
// converting nodelist to array to be returned
const array = Array.from(nList);
return array;
}
});
await pageBodies.goto("https://test.com/bodies", {
waitUntil: "load",
});
const rlBodyNames = await pageBodies.evaluate(() => {
// grabs all elements in html to make nodelist & converts it to an array
const array = grabDomConvertNodlistToArray(".testbodies > div > h1");
// push the data collected from array into data array and returned
const data = [];
array.forEach((element) => {
data.push(element.textContent);
});
return data;
});
}
rlData();

Related

How to print javascript return value without clicking a button?

I want to print the followers in my webpage, it shows up in the console, but not the html document.
the code:
async function getFollowers(user) {
const response = await fetch(`https://scratchdb.lefty.one/v3/user/info/${user}`);
let responseJson = await response.json();
const count = document.getElementById.innerHTML("123");
count = responseJson.statistics.followers;
return count;}
function pfollows(user) {
const element = document.getElementById.innerHTML("123");
const USER = user;
getFollowers(USER).then(count => {
element.textContent = `${USER} has ${count} followers right now.`;
});
}
document.getElementById.innerHTML("123") seems wrong.
You should be passing an id as a string to document.getElementById like document.getElementById("someIdHere"). innerHTML is not a function and shouldn't have parentheses or an argument after it.
count looks like it should be extracted from responseJson and returned.
pfollows looks like it might be responsible for updating the actual DOM.
Redefining user as USER is somewhat redundant.
async function getFollowers(user) {
const response = await fetch(`https://scratchdb.lefty.one/v3/user/info/${user}`);
let responseJson = await response.json();
const count = responseJson.statistics.followers;
return count;
}
function pfollows(user) {
const element = document.getElementById("123").innerHTML;
getFollowers(user).then(count => {
element.textContent = `${user} has ${count} followers right now.`;
});
}
When pfollows is called, then the element with id="123" should have its content set to the desired string. If pfollows() is not called, then nothing will happen.
There's a few more possible touchups to clean up:
responseJson can probably be a const
You can inline return count without saving it to a variable return responseJson.statistics.followers
but trying to keep the changes minimal as needed to fix the problems.

Cannot get querySelectorAll to work with puppeteer (returns undefined)

I'm trying to practice some web scraping with prices from a supermarket. It's with node.js and puppeteer. I can navigate throught the website in beginning with accepting cookies and clicking a "load more button". But then when I try to read div's containing the products with querySelectorAll I get stuck. It returns undefined even though I wait for a specific div to be present. What am I missing?
Problem is at the end of the code block.
const { product } = require("puppeteer");
const scraperObjectAll = {
url: 'https://www.bilkatogo.dk/s/?query=',
async scraper(browser) {
let page = await browser.newPage();
console.log(`Navigating to ${this.url}`);
await page.goto(this.url);
// accept cookies
await page.evaluate(_ => {
CookieInformation.submitAllCategories();
});
var productsRead = 0;
var productsTotal = Number.MAX_VALUE;
while (productsRead < 100) {
// Wait for the required DOM to be rendered
await page.waitForSelector('button.btn.btn-dark.border-radius.my-3');
// Click button to read more products
await page.evaluate(_ => {
document.querySelector("button.btn.btn-dark.border-radius.my-3").click()
});
// Wait for it to load the new products
await page.waitForSelector('div.col-10.col-sm-4.col-lg-2.text-center.mt-4.text-secondary');
// Get number of products read and total
const loadProducts = await page.evaluate(_ => {
let p = document.querySelector("div.col-10.col-sm-4.col-lg-2").innerText.replace("INDLÆS FLERE", "").replace("Du har set ","").replace(" ", "").replace(/(\r\n|\n|\r)/gm,"").split("af ");
return p;
});
console.log("Products (read/total): " + loadProducts);
productsRead = loadProducts[0];
productsTotal = loadProducts[1];
// Now waiting for a div element
await page.waitForSelector('div[data-productid]');
const getProducts = await page.evaluate(_ => {
return document.querySelectorAll('div');
});
// PROBLEM HERE!
// Cannot convert undefined or null to object
console.log("LENGTH: " + Array.from(getProducts).length);
}
The callback passed to page.evaluate runs in the emulated page context, not in the standard scope of the Node script. Expressions can't be passed between the page and the Node script without careful considerations: most importantly, if something isn't serializable (converted into plain JSON), it can't be transferred.
querySelectorAll returns a NodeList, and NodeLists only exist on the front-end, not the backend. Similarly, NodeLists contain HTMLElements, which also only exist on the front-end.
Put all the logic that requires using the data that exists only on the front-end inside the .evaluate callback, for example:
const numberOfDivs = await page.evaluate(_ => {
return document.querySelectorAll('div').length;
});
or
const firstDivText = await page.evaluate(_ => {
return document.querySelector('div').textContent;
});

How to push elements to array inside async function?

How to push elements in array inside async function (puppetteer)?
The page is structured into severel levels. Levels only appear by clicking on a link inside a cell-element with a specific ID.
I already achieved to select current shown ID's, push them into an array to loop through and click links inside elements with that ID to open next hirachy.
After this I repeat this process, by loopin through the difference of the new selected array of ID's (old ID's + new ID's) minus array of ID's from previous loop (old ID's).
Problem appears at executing second loop.
It seems that i made a mistake at pushing elements into array while inside async function... Some links dont get clicked, but with no scheme, so i assume the problem gets caused by async.
Thank you! Sry if thats not a proper description, thats my first question and i am new to this async world.
(async function main() {
try {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('---url---');
await page.waitForSelector('.cotable');
const arrIDMaster = [];
await levelLoop(page, arrIDMaster);
await levelLoop(page, arrIDMaster);
} catch (e) {
}
})();
async function levelLoop(page, arrIDMaster) {
const rows = await page.$$('.coRow.hi.coTableR');
const arrIDLocalGet = [];
//Get all ID's
for (const row of rows) {
const rowID = await page.evaluate(el => el.id, row);
await arrIDLocalGet.push(rowID);
}
// First one needs to be removed
arrIDLocalGet.shift();
//Create Local ID-Array - difference
const arrIDLocal = await differenceOf2Arrays(arrIDMaster,arrIDLocalGet);
// Loop through local ID-Array and Click
for (const id of arrIDLocal) {
const rows = await page.$('#' + id);
const link = await rows.$('a.KnotenLink');
await link.click();
}
//push only new ID's to global array
for (const id of arrIDLocal) {
await arrIDMaster.push(id);
}
};
function differenceOf2Arrays (array1, array2) {
return new Promise(resolve => {
let arrdiff = array2.filter(x => !array1.includes(x));
resolve(arrdiff);
});
};
Hopefully someone can help me, perhaps it's just a mistake in generell because i am pretty new to this stuff. Beside that i am sure thats not the most beatifull solution, I am also happy for suggestions to this!

Puppeteer: How to get the contents of each element of a nodelist?

I'm trying to achieve something very trivial: Get a list of elements, and then do something with the innerText of each element.
const tweets = await page.$$('.tweet');
From what I can tell, this returns a nodelist, just like the document.querySelectorAll() method in the browser.
How do I just loop over it and get what I need? I tried various stuff, like:
[...tweets].forEach(tweet => {
console.log(tweet.innerText)
});
page.$$():
You can use a combination of elementHandle.getProperty() and jsHandle.jsonValue() to obtain the innerText from an ElementHandle obtained with page.$$():
const tweets = await page.$$('.tweet');
for (let i = 0; i < tweets.length; i++) {
const tweet = await (await tweets[i].getProperty('innerText')).jsonValue();
console.log(tweet);
}
If you are set on using the forEach() method, you can wrap the loop in a promise:
const tweets = await page.$$('.tweet');
await new Promise((resolve, reject) => {
tweets.forEach(async (tweet, i) => {
tweet = await (await tweet.getProperty('innerText')).jsonValue();
console.log(tweet);
if (i === tweets.length - 1) {
resolve();
}
});
});
page.evaluate():
Alternatively, you can skip using page.$$() entirely, and use page.evaluate():
const tweets = await page.evaluate(() => Array.from(document.getElementsByClassName('tweet'), e => e.innerText));
tweets.forEach(tweet => {
console.log(tweet);
});
According to puppeteer docs here, $$ Does not return a nodelist, instead it returns a Promise of Array of ElementHandle. It's way different then a NodeList.
There are several ways to solve the problem.
1. Using built-in function for loops called page.$$eval
This method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.
So to get innerText is like following,
// Find all .tweet, and return innerText for each element, in a array.
const tweets = await page.$$eval('.tweet', element => element.innerText);
2. Pass the elementHandle to the page.evaluate
Whatever you get from await page.$$('.tweet') is an array of elementHandle. If you console, it will say JShandle or ElementHandle depending on the type.
Forget the hard explanation, it's easier to demonstrate.
// let's just call them tweetHandle
const tweetHandles = await page.$$('.tweet');
// loop thru all handles
for(const tweethandle of tweetHandles){
// pass the single handle below
const singleTweet = await page.evaluate(el => el.innerText, tweethandle)
// do whatever you want with the data
console.log(singleTweet)
}
Of course there are multiple ways to solve this problem, Grant Miller also answered few of them in the other answer.

different representation of arrays in chrome console

Here is my code
let loadInitialImages = ($) => {
let html = "";
let images = new Array();
const APIURL = "https://api.shutterstock.com/v2/images/licenses";
const request = async() => {
const response = await fetch(APIURL, { headers: auth_header() } );
const json = await response.json();
json.data.map((v) => images.push(v.image.id)); //this is where the problem is
}
request();
// I can see the contents of the array when I log it.
console.log(images);
// But I can't see any elements when logging this way:
images.map((id) => console.log(id));
}
Everything is working fine here but the problem is when I'm pushing the elements into the array is goes out of the array braces [] below is the screenshot of my array:
I'm not able to loop through the array here.
This is how a usual Array looks like in Console
See Array braces here. Elements appear to be inside [1, 2, 3]
Since your request function is async you need to treat its result as a Promise.
This is also the reason why you see it represented differently in the chrome console. An empty array gets printed, but the references in the console are updated dynamically, so you can still expand it and see the contents.
If you want to log the contents of the array statically, you could use something like JSON.stringify to print it. This will print a string representation of the exact state of the array at the time of logging.
// You will need to check the output in the browser console.
// Your code could be reduced to this:
const a = [];
setTimeout(() => a.push(1, 2), 100);
console.log('a:', a);
// A filled array logs differently:
const b = [1, 2];
console.log('b:', b);
// Stringify gives you a fixed state:
const c = [];
setTimeout(() => c.push(1, 2), 100);
console.log('c:', JSON.stringify(c));
Regarding your code, on top of waiting for request(), if you are using map you should take advantage of how it works. You can use it to generate your entire array without using push for example. If you still want to use your array and push() to it, you should use json.data.forEach instead of json.data.map since it doesn't duplicate the array.
// Making your function `async` so you can `await` for the `request()`
let loadInitialImages = async ($) => {
let html = "";
const APIURL = "https://api.shutterstock.com/v2/images/licenses";
const request = async () => {
const response = await fetch(APIURL, { headers: auth_header() } );
const json = await response.json();
// Array.map will return a new array with the results of applying
// the given function to the original array, you can use that as
// an easy way to return your desired array.
return json.data.map((v) => v.image.id);
}
// Since request() is async, you need to wait for it to complete.
const images = await request();
// Array.forEach lets you iterate over an array without generating a
// copy. If you use map here, you would be making an unneeded copy
// of your images array.
images.forEach(i => console.log(i));
}
Snippet below demonstrates your issue (your case is arr1, you want arr2).
In case loadInitialImages can't be async use arr3 scenario.
async function main(){
let arr1 = [], arr2 = [], arr3 = [];
const getArray = ()=> (new Promise(resolve=>setTimeout(()=>{resolve([1,2,3])},1000)))
async function request(arr, number){
var result = await getArray();
result.forEach((el)=>(arr.push(el)))
console.log(`inner${number}`, arr)
return result;
}
request(arr1, 1);
console.log("outer1", arr1)
await request(arr2, 2);
console.log("outer2", arr2)
request(arr3, 3).then(()=>{
console.log("then3",arr3)
})
console.log("outer3", arr3)
}
main();
I think the probleme is that the console.log() is fired before the array is populated, and becose the console.log work with reference it print both state of array (when it's empty, and after populating it with .map)
you can test this code ?
the console is directly after the loop
let loadInitialImages = ($) => {
let html = "";
let images = new Array();
const APIURL = "https://api.shutterstock.com/v2/images/licenses";
const request = async() => {
const response = await fetch(APIURL, { headers: auth_header() } );
const json = await response.json();
json.data.map((v) => images.push(v.image.id)); //this is where the problem is
console.log(images);
}
request();
}
let loadInitialImages = ($) => {
let html = "";
let images = new Array();
const APIURL = "https://api.shutterstock.com/v2/images/licenses";
const request = async() => {
const response = await fetch(APIURL, { headers: auth_header() } );
const json = await response.json();
json.data.map((v) => images.push(v.image.id)); //this is where the problem is
console.log(images);
}
request();
}
loadInitialImages();

Categories

Resources