I'm scraping from a website that has a lot of nested HTML elements, but what interests me are the abbr elements. In my case those abbr elements have data-utime attribute, so they are defined as <abbr data-utime="someValue">some other nested HTML</abbr>. So, what I want to do is that I want to get the data-utime attribute value of the last abbr element on the page.
I tried to do something like this:
const SELECTOR = 'abbr:last-child';
const result = await page.evaluate((selector) => {
return document.querySelector(selector);
}, SELECTOR);
console.log(result);
console.log(typeof(res));
console.log(result.getAttribute('data-utime'));
But the problem is that in the output that I get, result is just an empty object ({}), so typeof(res) returns object, and it of course doesn't have getAttribute function then. I believe also last-child selector is the proper way to get the last abbr element on the page. Any ideas how to achieve what I want?
evaluate is run in the page’s context; the result is serialized and returned. Use $$eval instead:
const SELECTOR = "abbr";
const result =
await page.$$eval(SELECTOR,
(elements) => elements[elements.length - 1].dataset.utime);
console.log(result);
You can also use evaluate and call document.querySelectorAll inside it, but I prefer to keep the selectors in my Puppeteer code so I can reuse them.
Related
So in essence I get the value of input, then try to divide into into different tags via the comma with this
var noteTags = document.getElementsByClassName("noteTag").value;
Tag = noteTags.split(",");
But in console, the split(",") is undefined
Edit: Sorry I forgot to mention that the noteTag element is input, does this change how the code works in any way?
There are two issues,
getElementsByClassName returns an array-like collection of elements (a NodeList).
And instead of value it should be innerText.
Try like below
var noteTags = document.getElementsByClassName("noteTag")[0].innerText;
Tag = noteTags.split(",");
You are using the split() method on an array. You are also trying to access the value property, whereby you should probably use innerText.
You can use querySelector then you dont need using a key[0] to select the element.
const noteTags = document.querySelector("#noteTag");
console.log(noteTags)
Tag = noteTags.innerHTML.split(",");
console.log(Tag)
<div id="noteTag">en,jp,fr</div>
EDIT: the document.querySelectorAll solution works, and is easier to read and understand. My own solution (in the answers, below) also works, and is slightly faster. The getElementsByClassName + getElementsByClassName solution is the fastest, so I've marked it as the accepted solution.
ORIGINAL POST: I need to find child elements of any element with a particular class, e.g.,
<li class="myclass"><a>This is the link I need to find</a></li>
so that I can set and remove some attributes from the anchor.
I can easily find all of the list items with getElementsByClassName, but getElementsByTagName fails because it only works on a single declared element (not on a collection). Therefore, this does not work:
const noLinks = document.getElementsByClassName('myclass');
for (let noLink of noLinks) {
const matches = noLinks.getElementsByTagName('a');
matches.setAttribute('role', 'link');
matches.setAttribute('aria-disabled', 'true');
matches.removeAttribute('href');
matches.removeAttribute('rel');
};
How can I iterate through the returned elements and get the tags inside of them?
The problem is in getElementsByTagName which returns a live HTMLCollection of elements, Your matches variable contains an array whereas must be an element to apply to him some properties href, rel..., So he needs to be an element not elments, To solve the problem just access to the first element not all of them, or use querySelector which return the first matched element if exist.
const noLinks = document.getElementsByClassName('myclass');
for (let noLink of noLinks) {
//v-- access to noLink not noLinks
const matches = noLink.getElementsByTagName('a')[0]; //<-- or noLinks.querySelector('a')
matches.setAttribute('role', 'link');
matches.setAttribute('aria-disabled', 'true');
matches.removeAttribute('href');
matches.removeAttribute('rel');
};
The OP's code could be switched to something more expressive (based on e.g. querySelectorAll) like ...
document
.querySelectorAll('.myclass a')
.forEach(elmNode => {
elmNode.setAttribute('role', 'link');
elmNode.setAttribute('aria-disabled', 'true');
elmNode.removeAttribute('href');
elmNode.removeAttribute('rel');
});
The following solution works. I'll probably test the other 2 offered solutions & upvote them if they work, but I'm posting this answer so others can see different ways of solving this.
// Modify the attributes on the <a> inside the <li> with class "nolink".
const noLinks = document.getElementsByClassName('nolink');
Array.prototype.forEach.call(noLinks, function(noLink) {
const matches = noLink.getElementsByTagName('a')[0];
matches.setAttribute('role', 'link');
matches.setAttribute('aria-disabled', 'true');
matches.removeAttribute('href');
matches.removeAttribute('rel');
});
I am trying to get the text from the following Xpath as a string:
//*[contains(text(), 'mission')]/following-sibling::text()[1]
I have tried
let elHandle = await page.$x("//*[contains(text(), 'mission')]/following-sibling::text()[1]")
which returns an ElementHandle<Element>[]. How can I navigate from here to get to the text string?
I am assuming your XPath is correct. So: page.$x returns an array (of matched elements: <Promise<Array<ElementHandle>>>) where you need the 1st element so you will need to add [0] after the whole element handle expression.
It can be combined with a page.evaluate to retrieve the innerText string.
const elHandleText = await page.evaluate(el => el.innerText, (await page.$x("//*[contains(text(), 'mission')]/following-sibling::text()[1]"))[0])
console.log(elHandleText)
Your question about if it can be done with CSS selectors: It is not possible, XPath's contains method is the solution if you need to find an element with specific text content.
I'm using Puppeteer and am trying to use document.querySelectorAll to get a list of elements to then loop over and do something, however, it seems that something is wrong in my code, it either returns nothing, undefined or an empty {} despite my elements being on the page, my JS:
let elements = await page.evaluate(() => document.querySelectorAll("div[class^='my-class--']"))
for (let el of Array.from(elements)) {
// do something
}
what's wrong with my elements and page.evaluate here?
As far as I understand, puppeteer returns all the HTML as a giant string. This is because Node doesn't run in the browser so the HTML doesn't get parsed. So DOM selectors won't work.
What you can do to solve this issue is to use the Cheerio.js module, which allows you to grab elements with JQuery as if it is a parsed DOM.
Since puppeteer returns all HTML as a string you could use DOMParser like in the below example.
let doc = new DOMParser().parseFromString('<template class="myClass"><span class="target">check it out</span></template>', 'text/html');
let templateContent = doc.querySelector("template");
let template = new DOMParser().parseFromString(templateContent.innerHTML, 'text/html');
let target = template.querySelector("span");
console.log([templateContent,target]);
I wonder if there's a similar way as in Selenium to wait for text to appear for a particular element. I've tried something like this, but it doesn't seem to wait:
await page.waitForSelector('.count', {visible: true});
You can use waitForFunction. See https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagewaitforfunctionpagefunction-options-args
Including #elena's solution for completeness of the answer:
await page.waitForFunction('document.querySelector(".count").innerText.length == 7');
Apart from the method presented in the answer from nilobarp, there are two more ways to do this:
page.waitForSelector
Using the pseudo selector :empty it is possible to find elements that contain no child nodes or text. Combining this with the :not selector, we can use page.waitForSelector to query for a selector which is not empty:
await page.waitForSelector('.count:not(:empty)');
XPath expression
If you not only want to make sure that the element is not empty, but also want to check for the text it contains, you can use an XPath expression using page.waitForXPath:
await page.waitForXPath("//*[#class='count' and contains(., 'Expected text')]");
This line will only resolve after there is an element on the page which has the attribute class="count" and contains the text Expected text.
The best solution you can do using waitForFunction() (avoid weird function as string):
const selector = '.count';
await page.waitForFunction(
selector => document.querySelector(selector).value.length > 0,
{},
selector
);
Depends of the type of the text, replace value by innerText.
Check puppeteer API
page.waitFor()
You can also just simply use page.waitFor() to pass a function or CSS selector for which to wait.
Wait for Function
If the element is an input field, we can check that the .count element exists before checking that a value is present to avoid potential errors:
await page.waitFor(() => {
const count = document.querySelector('.count');
return count && count.value.length;
});
If the element is not an input field, we can check that the .count element exists before checking that innerText is present to avoid potential errors:
await page.waitFor(() => {
const count = document.querySelector('.count');
return count && count.innerText.length;
});
Wait for CSS Selector
If the element is an input field that contains a placeholder, and you want to check if a value currently exists, you can use :not(:placeholder-shown):
await page.waitFor('.count:not(:placeholder-shown)');
If the element is an input field that does not contain a placeholder, and you want to check if the value attribute contains a string, you can use :not([value=""]):
await page.waitFor('.count:not([value=""])');
If the element is not an input field that does not have any child element nodes, we can use :not(:empty) to wait for the element to contain text:
await page.waitFor('.count:not(:empty)');
page.waitForXPath()
Wait for XPath
Otherwise, you can use page.waitForXPath() to wait for an XPath expression to locate element(s) on the page.
The following XPath expressions will work even if there are additional classes present on the element other than count. In other words, it will work like .count, rather than [class="count"].
If the element is an input field, you can use the following expression to wait for the value attribute to contain a string:
await page.waitForXPath('//*[contains(concat(" ", normalize-space(#class), " "), " test ") and string-length(#value) > 0]')
If the element is not an input field, you can use the following expression to wait for the element to contain text:
await page.waitForXPath('//*[contains(concat(" ", normalize-space(#class), " "), " count ") and string-length(text()) > 0]');
await page.waitFor((name) => {
return document.querySelector('.top .name')?.textContent == name;
}, {timeout: 60000}, test_client2.name);
waitForXPath is simple and works well for finding an element with specific text.
const el = await page.waitForXPath('//*[contains(text(), "Text to check")]');