I'm working on a web scraper in Javascript using puppeteer and whenever I try to log the text content of an element it says "Promise { Pending }". I've looked at other answers and none of them worked
const element = await page.$("#ctl00_ContentPlaceHolder1_NameLinkButton");
const text = await page.evaluate(element => element.textContent, element);
console.log(text);
Your answer is correct. but I think you forget to add await before page.evaluate().
There three ways to do that.
First way. just like what are you do. but I don't prefer it because
you don't need to call page.evaluate() to get .textContent
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const elementId = 'container';
const page = await browser.newPage();
await page.goto('https://metwally.me');
const element = await page.$(`#${elementId}`);
if (element) {
const text = await page.evaluate(element => element.textContent, element);
console.log(text);
} else {
// handle not exists id
console.log('Not Found');
}
});
Second way. you will call page.evaluate() and use JavaScript Dom to get textContent. like document.getElementById(elementId).textContent.
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const elementId = 'container';
const page = await browser.newPage();
await page.goto('https://metwally.me');
const text = await page.evaluate(
elementId => {
const element = document.getElementById(elementId);
return element ? element.textContent : null;
}, elementId);
if (text !== null) {
console.log(text);
} else {
// handle not exists id
console.log('Not Found');
}
});
Third way. you will select element by puppeteer selector then get textContent property using await element.getProperty('textContent') then get value from textContent._remoteObject.value.
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const elementId = 'container';
const page = await browser.newPage();
await page.goto('https://metwally.me');
const element = await page.$(`#${elementId}`);
if (element) {
const textContent = await element.getProperty('textContent');
const text = textContent._remoteObject.value;
console.log(text);
} else {
// handle not exists id
console.log('Not Found');
}
});
NOTE: All these examples working successfully in my machine.
os ubuntu 20.04
nodejs v10.19.0
puppeteer v1.19.0
References
Puppeteer page.$
Document.getElementById()
Node.textContent
Related
This question already has answers here:
Puppeteer page.evaluate querySelectorAll return empty objects
(3 answers)
Closed 3 days ago.
I'm working with Node.js and Puppeteer for the first time and can't find a way to output values from page.evaluate to the outer scope.
My algorithm:
Login
Open URL
Get ul
Loop over each li and click on it
Wait for innetHTML to be set and add it's src content to an array.
How can I return data from page.evaluate()?
const puppeteer = require('puppeteer');
const CREDENTIALS = require(`./env.js`).credentials;
const SELECTORS = require(`./env.js`).selectors;
const URLS = require(`./env.js`).urls;
async function run() {
try {
const urls = [];
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(URLS.login, {waitUntil: 'networkidle0'});
await page.type(SELECTORS.username, CREDENTIALS.username);
await page.type(SELECTORS.password, CREDENTIALS.password);
await page.click(SELECTORS.submit);
await page.waitForNavigation({waitUntil: 'networkidle0'});
await page.goto(URLS.course, {waitUntil: 'networkidle0'});
const nodes = await page.evaluate(selector => {
let elements = document.querySelector(selector).childNodes;
console.log('elements', elements);
return Promise.resolve(elements ? elements : null);
}, SELECTORS.list);
const links = await page.evaluate((urls, nodes, VIDEO) => {
return Array.from(nodes).forEach((node) => {
node.click();
return Promise.resolve(urls.push(document.querySelector(VIDEO).getAttribute('src')));
})
}, urls, nodes, SELECTORS.video);
const output = await links;
} catch (err) {
console.error('err:', err);
}
}
run();
The function page.evaluate() can only return a serializable value, so it is not possible to return an element or NodeList back from the page environment using this method.
You can use page.$$() instead to obtain an ElementHandle array:
const nodes = await page.$$(`${selector} > *`); // selector children
If the length of the constant nodes is 0, then make sure you are waiting for the element specified by the selector to be added to the DOM with page.waitForSelector():
await page.waitForSelector(selector);
let elementsHendles = await page.evaluateHandle(() => document.querySelectorAll('a'));
let elements = await elementsHendles.getProperties();
let elements_arr = Array.from(elements.values());
Use page.evaluateHandle() to return a DOM node as a Puppeteer ElementHandle that you can manipulate in Node.
I'm trying to scrape the link to the next page from this webpage. I know how to scrape that using css selector. However, things go wrong when I attempt to parse the same using xpath. This is what I get instead of the next page link.
const puppeteer = require("puppeteer");
let url = "https://stackoverflow.com/questions/tagged/web-scraping";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
let nextPageLink = await page.$x("//a[#rel='next']", item => item.getAttribute("href"));
// let nextPageLink = await page.$eval("a[rel='next']", elm => elm.href);
console.log("next page:",nextPageLink);
await browser.close();
})();
How can I scrape the link to the next page using xpath?
page.$x(expression) returns an array of element handles. You need either destructuring or index acces to get the first element from the array.
To get a DOM element property from this element handle, you need either evaluating with element handle parameter or element handle API.
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await nextPageLink.evaluate(link => link.href);
Or:
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await (await nextPageURL.getProperty('href')).jsonValue();
I am trying to get all input element in this website:
http://rwis.mdt.mt.gov/scanweb/swframe.asp?Pageid=SfHistoryTable&Units=English&Groupid=269000&Siteid=269003&Senid=0&DisplayClass=NonJava&SenType=All&CD=7%2F1%2F2020+10%3A41%3A50+AM
Here is element source page looks like.
here is my code:
const puppeteer = require("puppeteer");
function run() {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
"http://rwis.mdt.mt.gov/scanweb/swframe.asp?Pageid=SfHistoryTable&Units=English&Groupid=269000&Siteid=269003&Senid=0&DisplayClass=NonJava&SenType=All&CD=7%2F1%2F2020+10%3A41%3A50+AM"
);
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll("input").length;
return items;
});
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
});
}
run().then(console.log).catch(console.error);
Right now my output have 0, when i run document.querySelectorAll("input").length in the console, it give me 8 .
It seems like everything is loaded in the frameset tag, this might be the issue, could anyone have any idea how to solve this issue?
You have to get the frame element, from there you can get the frame itself so you can call evaluate inside that frame:
const elementHandle = await page.$('frame[name=SWContent]');
const frame = await elementHandle.contentFrame();
let urls = await frame.evaluate(() => {
let results = [];
let items = document.querySelectorAll("input").length;
return items;
});
I am running an automated test through puppeteer that fills up a form and checks for captcha as well. If the captcha is incorrect, it refreshes to a new image but then I need to process the whole image again and reach the function which was used earlier to process it.
(async function example() {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
/*-----------NEED TO COME BACK HERE-----------*/
const tessProcess = utils.promisify(tesseract.process);
await page.setViewport(viewPort)
await page.goto('http://www.example.com')
await page.screenshot(options)
const text = await tessProcess('new.png');
console.log(text.trim());
await page.$eval('input[id=userEnteredCaptcha]', (el, value) => el.value = value, text.trim())
await page.$eval('input[id=companyID]', el => el.value = 'val');
const submitBtn = await page.$('[id="data"]');
await submitBtn.click();
try {
var x = await page.waitFor("#msgboxclose");
console.log("Captcha error")
}
catch (e) {
console.error('No Error');
}
if(x){
await page.keyboard.press('Escape');
/*---------GO FROM HERE--------*/
}
})()
I want to sort of create a loop so that the image can be processed again whenever the captcha is wrong
Declare a boolean variable that indicates whether you need to try again or not, and put the repeated functionality inside a while loop that checks that variable. If the x condition at the end of the loop is not fulfilled, set tryAgain to false, so that no further iterations occur:
(async function example() {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
let tryAgain = true; // <--------------------------
while (tryAgain) { // <--------------------------
/*-----------NEED TO COME BACK HERE-----------*/
const tessProcess = utils.promisify(tesseract.process);
await page.setViewport(viewPort)
await page.goto('http://www.example.com')
await page.screenshot(options)
const text = await tessProcess('new.png');
console.log(text.trim());
await page.$eval('input[id=userEnteredCaptcha]', (el, value) => el.value = value, text.trim())
await page.$eval('input[id=companyID]', el => el.value = 'val');
const submitBtn = await page.$('[id="data"]');
await submitBtn.click();
try {
var x = await page.waitFor("#msgboxclose");
console.log("Captcha error")
}
catch (e) {
console.error('No Error');
}
if(x){
await page.keyboard.press('Escape');
/*---------GO FROM HERE--------*/
} else {
tryAgain = false; // <--------------------------
}
}
})()
I have a function which takes in a tag name and text as input and returns all the elements made of the given tag containing the given text as output (I get an array of all the elements having the matching text).
I will be using this function across multiple functions so I thought I could save it in another file and import the function into all the other files that I may need but I am unable to transfer the element. I am using puppeteer to open the browser and get my required documents.
The code I am importing:
commonFunctions.js:
module.exports = {
matchTagAndTextContents: async function matchTagAndTextContents(page, selector, text) {
const ele = await page.evaluate((selector,text) => {
function matchTagAndText(sel, txt) {
var elements = document.querySelectorAll(selector);
return Array.prototype.filter.call(elements, function(element){
return RegExp(text).test(element.textContent);
});
}
const matchedElements = matchTagAndText(selector,text);
return matchedElements;
},selector,text);
return ele;
}
}
Another file where I try to use the imported function:
foo.js:
const commonFunctions = require('./commonFunctions');
const puppeteer = require('puppeteer');
let browser = null;
browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
(async () => {
let page = await browser.newPage();
await page.goto("https://www.google.com");
let elem = null;
await commonFunctions.matchTagAndTextContents(page,'h1','Google').then( res => {
elem = res;
});
await page.evaluate((elem) => {
elem.forEach( el => {
el.click();
})
},elem);
})();
Here inside foo.js I keep getting el.click() is not a function, but if I implement the forEach inside the commonFunctions.js like:
matchedElements.forEach( el => {
el.click();
});
It works and the element gets clicked. What am I doing wrong?
Thats beacause elem is null in your execution and res its assigned to elem in the evaluated scope.
try changing
let elem = null;
await commonFunctions.matchTagAndTextContents(page,'h1','Google').then( res => {
elem = res;
});
await page.evaluate((elem) => {
elem.forEach( el => {
el.click();
})
},elem);
whith
var elem = null;
elem = await commonFunctions.matchTagAndTextContents(page,'h1','Google');
await page.evaluate((elem) => {
elem.forEach( el => {
el.click();
})
},elem);