In Puppeteer, how do I get the innerHTML of a selector? - javascript

At the moment, I'm trying the following:
const element = await page.$("#myElement");
const html = element.innerHTML;
I'm expecting the HTML to be printed, but instead I'm getting undefined.
What am I doing wrong?

page.evaluate():
You can use page.evaluate() to get the innerHTML of an element:
const inner_html = await page.evaluate(() => document.querySelector('#myElement').innerHTML);
console.log(inner_html);
elementHandle.getProperty() / .jsonValue():
Alternatively, if you must use page.$(), you can access the innerHTML using a combination of elementHandle.getProperty() and elementHandle.jsonValue():
const inner_html = await (await (await page.$('#myElement')).getProperty('innerHTML')).jsonValue();
console.log(inner_html);

You can use page.$eval to access innerHTML pf specified DOM.
Snippet sing page.$eval
const myContent = await page.$eval('#myDiv', (e) => e.innerHTML);
console.log(myContent);
Works great with jest-puppeter.

You have to use an asynchronous function evaluate.
In my case, I was using puppeteer/jest and all await* solutions was trowing
error TS2531: Object is possibly 'null'
innerHTML with if statement was working for me(part of example)
I was searching the member's table for the first unchecked element
while (n) {
let elementI = await page.$('div >' + ':nth-child(' + i + ')' + '> div > div div.v-input__slot > div');
let nameTextI = await page.evaluate(el => el.innerHTML , elementI);
console.log('Mistery' + nameTextI);
{...}
i++;
{...}
}

const html = await page.$eval("#myElement", e => e.innerHTML);

Related

TypeError: Cannot read property 'getProperty' of undefined? Node.js, Puppeteer

When trying to get the text of the 'name' element for my scraper. I try to grab it with the full Xpath and get the error 'TypeError: Cannot read property 'getProperty' of undefined' I tried just using the regular Xpath but that said name: 'skip navigation' why is get property coming back as undefined? it only happens when trying to get the channel title, it works when getting the profile image.
scaper.js
const puppeteer = require('puppeteer');
async function scrapeChannel(url) {
const browser = await puppeteer.launch()
const page = await browser.newPage();
await page.goto(url);
// const xpath_expression = '/html/body/ytd-app/div/ytd-page-manager/ytd-browse[2]/div[3]/ytd-c4-tabbed-header-renderer/tp-yt-app-header-layout/div/tp-yt-app-header/div[2]/div[2]/div/div[1]/div/div[1]/ytd-channel-name/div/div/yt-formatted-string';
// await page.waitForXPath(xpath_expression);
const [el] = await page.$x('/html/body/ytd-app/div/ytd-page-manager/ytd-browse[2]/div[3]/ytd-c4-tabbed-header-renderer/tp-yt-app-header-layout/div/tp-yt-app-header/div[2]/div[2]/div/div[1]/div/div[1]/ytd-channel-name/div/div/yt-formatted-string');
const text = await el.getProperty('textContent');
const name = await text.jsonValue();
const [el2] = await page.$x('//*[#id="img"]');
const src = await el2.getProperty('src');
const avatarURL = await src.jsonValue();
browser.close();
console.log({name, avatarURL});
return { name, avatarURL}
}
}
scrapeChannel('https://www.youtube.com/channel/UC8butISFwT-Wl7EV0hUK0BQ')
index.html
function newEl(type, attrs = {}) {
const el = document.createElement(type);
for (let attr in attrs) {
const value = attrs[attr];
if (attr == "innerText") el.innerText = value;
else el.setAttribute(attr, value);
}
return el;
}
It seems you may have a typo in XPath. When I try your XPath in the browser console, it returns no elements. However, with this one change, it returns an element:
$x('/html/body/ytd-app/div/ytd-page-manager/ytd-browse[1]/div[3]/ytd-c4-tabbed-header-renderer/tp-yt-app-header-layout/div/tp-yt-app-header/div[2]/div[2]/div/div[1]/div/div[1]/ytd-channel-name/div/div/yt-formatted-string')
.......................................................^: 1 instead of 2

Cannot filter empty element from an array

I have a problem with this piece of code.
I import input data from a file formated like so and store it in const input:
aabcccccaaa
aaccb
shsudbud
There are no spaces or any other white characters except from '\n' newline.
I get inputs in this way: (LiveServer inside VS Code)
const getData = async () => {
const resp = await fetch("./inputs.txt");
const data = await resp.text();
return data;
};
Then I call:
const myFunc = async () => {
const input = await getData();
const rows = input.split("\n").map(row => row);
rows.forEach(row => {
const charArr = [...row];
console.log(charArr);
});
};
After logging to console first and second row it seems like there is "" (empty string) attached to the end of each of them. The third element is fine so I guess its somehow connected with newline character.
I have also tried creating charArr by doing:
const charArr = Array.from(row);
Or
const charArr = row.split("");
But the outcome was the same.
Later I found this topic: Remove empty elements from an array in Javascript
So I tried:
const charArr = [...row].filter(Boolean);
But the "" is still at the end of charArr created from 1st and 2nd row.
const input = `aabcccccaaa
aaccb
shsudbud`;
const rows = input.split("\n").map(row => row);
rows.forEach(row => {
const charArr = [...row];
console.log(charArr);
});
In this snippet everything works fine. So here is where my questions start:
Why does .filter() method not work properly in this case?
Could this problem browser specific?
Thanks in advance.

Iterate over elements in Nodejs

I have a webpage, where I want to hover over all anchor tags and get the styles computed for that tag. This function which I wrote doesn't seem to work as it gives me original style of the anchor and not the hover styles.
Please help.
let data = await page.evaluate(() => {
let elements = document.getElementsByTagName('a');
properties = []
for (var element of elements){
element.focus();
properties.push(JSON.parse(JSON.stringify(window.getComputedStyle(element, null)["backgroundColor"])));
}
return properties;
});
https://developer.mozilla.org/en-US/docs/Web/API/Window/getComputedStyle
try document.getComputedStyle(element, ':hover')
First of all, you should convert results from document.getElementsByTagName to normal array
const elements = [...document.getElementsByTagName('textarea')];
Next to get element property use this syntax:
window.getComputedStyle(element).getPropertyValue("background-color")
Finally, this is a fully working example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://css-tricks.com/almanac/selectors/f/focus/');
const data = await page.evaluate(() => {
const elements = document.getElementsByTagName('textarea');
return [...elements].map(element => {
element.focus();
return window.getComputedStyle(element).getPropertyValue("background-color");
});
});
console.log(data);
await browser.close();
})();
You can use page.$$() to obtain an ElementHandle array of textarea elements.
Then, you can use the elementHandle.hover() to hover over each element and then page.evaluate() to obtain the computed background color to push to your data array:
const elements = await page.$$( 'textarea' );
const data = [];
for ( let i = 0; i < elements.length; i++ )
{
await elements[i].hover();
data.push( await page.evaluate( element => window.getComputedStyle( element ).backgroundColor, elements[i] ) );
}
console.log( data );

Waiting for an iframe to be opened and scraped is too slow to scrape js

I'm trying to scrape an old website built with tr, br and iframe. Everything was going good so far before I started to want to extract data from an iframe, see iFrameScraping setTimeout, but the clicking is too fast for me to be able to get the datas. Would anyone have an idea of how to click, wait for the content to show and be scraped, then continue?
const newResult = await page.evaluate(async(resultLength) => {
const elements = document.getElementsByClassName('class');
for(i = 0; i < resultLength; i++) {
const companyArray = elements[i].innerHTML.split('<br>');
let companyStreet,
companyPostalCode;
// Get company name
const memberNumber = elements[i].getElementsByTagName('a')[0].getAttribute('href').match(/[0-9]{1,5}/)[0];
const companyName = await companyArray[0].replace(/<a[^>]*><span[^>]*><\/span>/, '').replace(/<\/a>/, '');
const companyNumber = await companyArray[0].match(/[0-9]{6,8}/) ? companyArray[0].match(/[0-9]{6,8}/)[0] : '';
// Get town name
const companyTown = await companyArray[1].replace('"', '');
// Get region name
const companyRegion = await companyArray[2].replace(/<span[^>]*>Some text:<\/span>/, '');
// Get phone number
const telNumber = await elements[i].innerHTML.substring(elements[i].innerHTML.lastIndexOf('</span>')).replace('</span>', '').replace('<br>', '');
const iFrameScraping = await setTimeout(async({elements, i}) => {
elements[i].getElementsByTagName('a')[0].click();
const iFrameContent = await document.getElementById('some-id').contentWindow.document.getElementById('lblAdresse').innerHTML.split('<br>');
companyStreet = iFrameContent[0].replace('"', '');
companyPostalCode = iFrameContent[2].replace('"', '');
}, 2000, {elements, i});
console.log(companyStreet, companyPostalCode)
};
}, pageSearchResults.length);
I fixed my issues after a while, so I'll share my solution.
I add to stop getting all the data with a loop from the evaluate because it's going to fast and creating a race condition. Instead I used a combination of page.$$ coupled with a for…of loop. Note that the forEach from es6 are causing race condition as well, since puppeteer does not wait for them to end to continue its execution.
Here is the example from my updated code:
const companies = await page.$$('.repmbr_result_item');
const companiesLinks = await page.$$('.repmbr_result_item a');
for(company of companies) {
const companyEl = await page.evaluate(el => el.innerHTML, company)
const companyElArray = companyEl.split('<br>');

how to select innerHTML from an elementHandle in puppeteer

Using the node puppeteer module, how do I continue with this code to get the innerContent here?
const els = Promise.all(await page.$$(selector)).then(results => {
results.map(async el => {
const tr = await el.$('tr')
//How do I convert this element handle to get its innerText content?
})
})
Like this
textValue = tr.getProperty('innerText').jsonValue()

Categories

Resources