I have a project to scrape the products purchased by certain customers from an internal CRM. This CRM uses a lot of dynamically loaded tiles, so there are not many consistent class names (many have an ID randomly appending at each page load), and there are also many different reports/elements on a page with the same class name, so I can't query the whole page for an element selector.
I have identified the "parent" element that I want via xpath. I then want to drill down and get the innerText of only the children who match the query selector (most threads I see have people doing the query selector on the whole page, this will get results from menus I don't want).
I can do this in regular Javascript in the console of the browser, I just can't figure out how to do it in Node/Puppeteer. Here's what I have so far:
//Getting xpath of the "box" that contains all of the product tiles that a customer has
const productsBox = await page.$x("/html/body/blah/blah/blah");
This is where it breaks down. I'm not super familiar with some of the syntax or understanding Puppeteer's documentation, but I've tried a few different methods (I'm also not comfortable enough with functions to use the => format. The Puppeteer documentation has an example of what I'm trying to do, but I tried with the same structure and it also returned nothing):
//Tried using the elementHandle.$$eval approach on the zero index of my xpath results,
//but doesn't return anything when I console.log(productsList)
const productsList = await productsBox[0].$$eval('.title-heading', function parseAndText (products) {
productsList=[];
for (i=0; i<products.length; i++) {
productsList.push(products[i].innerText.trim());
}
return productsList;
}
);
//Tried doing the page.$$eval approach with selector, passing in the zero index of my xpath
const productsList = await page.$$eval('.title-heading', function parseAndText (products) {
productsList=[];
for (i=0; i<products.length; i++) {
productsList.push(products[i].innerText.trim());
}
return productsList;
}, productsBox[0]
//Tried the page.evaluate and then page.evaluateHandle approach on the zero index of my xpath,
//doing the query selection inside the evaluation and then doing something with that.
let productsList= await page.evaluateHandle(function parseAndText(productsBoxZero) {
productsInnerList = productsBoxZero.querySelectorAll(".title-heading");
productsList=[];
for (i=0; i<productsInnerList.length; i++) {
productsList.push(productsInnerList[i].innerText.trim());
//Threw a console log here to see if it does anything,
//But nothing is logged
console.log("Pushed product " + i + " into the product list");
}
return productsList;
}, productsBox[0]);
In terms of output, I've console logged some of the variables and I get this:
productsBox is JSHandle#node
productsBox[0] is JSHandle#node
productList is
For comparison, I was doing this in parallel via Javascript in the console to make sure I'm stepping through the logic correctly and I get what I expect:
>productsBox=$x("/html/body/blah/blah/blah");
>productsInnerList=productsBox[0].querySelectorAll(".title-heading");
>productsInnerList.length;
//2, and this customer has 2 products
>productsList=[];
>for (i=0; i<productsInnerList.length; i++) {
productsList.push(productsInnerList[i].innerText.trim());
};
>console.log(productsList)
>["Product 1", "Product 2"]
Thanks for reading this far and I appreciate your help!
[Edit]
For some additional research, I have tried to use page.evaluateHandle and tried to log my variables so far:
productsBox is JSHandle#node
productsBox[0] is JSHandle#node
productList is JSHandle#array
Which is progress. I tried to do:
let productsText=await productsList.jsonValue();
But when I try to output I get nothing:
await console.log("productsText is " + productsText);
productsBox is JSHandle#node
productsBox[0] is JSHandle#node
productList is JSHandle#array
productsText is
I'd suggest reading the docs carefully before trying every function.
$$eval evaluates on the selector and passing the element is pointless in this case. evaluateHandle is for returning in-page elements, since you're returning an array of text and it's serializable, you don't need it. All you need is to pass the element to page.evaluate or do everything in puppeteer context.
To be able to see in-page console.log you need to:
page.on('console', msg => console.log(msg.text()));
Using page.evaluate
let productsList= await page.evaluate((element) => {
const productsInnerList = element.querySelectorAll(".title-heading");
const productsList=[];
for (const el of productsInnerList) {
productsList.push(el.innerText.trim());
console.log("Pushed product " + el.innerText.trim() + " into the product list");
}
return productsList;
}, productsBox[0]);
Using elementHandle.$$
const productList = [];
const productsInnerList = await productsBox[0].$$('.title-heading');
for (const element of productsInnerList){
const innerText = await (await element.getProperty('innerText')).jsonValue();
productList.push(innerText);
}
Based on #mbit's answer I was able to get it to work. I first tested on another site that was similar in structure to mine. Copied code over to my original site and it still wasn't working, only got a null output. Turns out that while I had an await page.$x(full/xpath) for the parent element, the child elements that contained the innerText still hadn't loaded. So I did two things:
1) Added another await page.$x(full/xpath) for the first element in the list that was one of my targets
2) Implemented the page.evaluate approach provided by mbit.
2a) Explicitly wrote out the function (still wrapping head around the => structure)
Final code below (some variable names changed as a result of testing):
let productsTextList= await page.evaluate(function list(list) {
const productsInnerList = list.querySelectorAll(".title-heading");
productsTextList =[];
for (n=0; n<productsInnerList.length; n++) {
product=productsInnerList[n].innerText.trim();
productsTextList.push(product);
}
return productsTextList;
}, productsBox[0]);
console.log(productsTextList);
I chose the page.evaluate approach because it more closely matched what I was doing in the browser console, so easy to test with. The trick with the elementHandle.$$ approach was, as mbit mentioned, using await element.getProperty('innerText') rather than .innerText. Throughout troubleshooting and learning, I also stumbled across this thread on GitHub which also talks about how to extract it (same as mbit's approach above). For anyone running into similar issues you aren't alone!
Related
I have the following structure I want to query for some e2e angular test with protractor:
<div id='parentId'>
<div>First</div>
<div>Second</div>
<div>Third</div>
<div>Four</div>
</div>
Currently I want to retrieve a list of the texts of the child div elements :
['First','Second', 'Third', 'Fourth']
I remember being able to do so with protractor in Angular some time ago but I can't get it to work now.
Current code I have is:
mytest.e2e-spec.ts:
import { browser, element, ElementArrayFinder, WebElement } from 'protractor';
describe('A test here ', () => {
it('should find the texts in all four divs', async () => {
const elements: WebElement[] = await element.all(by.css('#parentId > div')).getWebElements();
let array = await Promise.all(mocks.map(async (item)=>{
let text = await item.getText();
// let text = await item.getAttribute('innerText'); = null
// let text = await item.getAttribute('innerHTML'); = null
// let text = await item.getAttribute('value'); = null
// let text = await item.getAttribute('textContent'); = null
return text;
}));
console.log(array);
}
});
The above code returns ['null', 'null', 'null', 'null']
I can get the tagName and length of the WebElement returned array from the query properly. However, so far the text query keeps returning 'null'.
I have tried without even looping and:
await elements[0].getText();
will also produce null.
Also tried following queries instead:
const elms0: WebElement[] = await (await browser.driver.findElement(by.id('parentId'))).findElements(by.tagName('div'));
const elms: WebElement[] = await browser.driver.findElements(by.css('#parentId > div'));
they both return the length of the elements correctly, however the text returns always null or empty. I suspect this has to do with my async operation somehow racing, but I have tried all remedies out there, nothing seems to work so far. :/
Currently using protractor 7.0.0 , although the same version has been working before. Already went through related questions, I don't think there is anything I haven't tried yet.
Using chrome driver on a MacOS.
EDIT: The parent div had display:none as its styling which made it impossible to fetch the text.
If this won't work
describe('A test here ', () => {
it('should find the texts in all four divs', async () => {
const elements = element.all(by.css('#parentId > div'));
let array = await elements.getText();
console.log(array);
})
});
then you likely to have either issue:
elements are not visible. But in this case .getText() would returns '' unless a recent update changed it
this is chrome version specific problem that has been reported element.getAttribute('value') returns null in Protractor
PS. I used javascript syntax
As #Sergey assumed, it was indeed another issue on my side.
the parentDiv had display:none as its style. Selenium WebDriver was somehow detecting the whole DOM structure but not the texts. Further posts here, here and here led to more and more confusion. It seems it is not possible to get text of div
element if it is hidden.
I'm trying to scrape URLs from https://en.wikipedia.org/wiki/List_of_hedge_funds
Specifically, I'm trying to use Apify to scrape that page and return a list of URLs from anchor tags present in the HTML. In my console, I expect to see the value of the href attribute of one or more anchor tags that exist on the target page in a property called myValue. I also expect to see the page title in a property called title. Instead, I just see the following URL property and its value.
My Apify actor uses the Puppeteer platform. So I'm using a pageFunction similar to the way Puppeteer uses it.
Below is a screen shot of the Apify UI just before I run it.
Page function
function pageFunction( context ) {
// called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
var result = {
title: $('.wikitable').text,
myValue: $('a[href]').text,
};
return result;
}
What am I doing wrong?
You have a typo in your code, text is a function so you need to add parentheses:
var result = {
title: $('.wikitable').text(),
myValue: $('a[href]').text(),
};
But note that this will probably not do what you expect anyway - it will return text of all matched elements. You probably need to use jQuery's each() function (https://api.jquery.com/jquery.each/) to iterate the found elements, push some values from them to an array and return the array from your page function.
The page seems to be loaded by JavaScript so actually I have to use asynchronous code.
Allright , so I had this function written a while ago and it was working well.
Basically I'm downloading a file and then checking if there is n items in chrome://downloads/ and if the filename matches
this.checkDownload = async function checkDownload(fileNameRegEx) {
var regex = new RegExp(fileNameRegEx);
if ((await browser.getCapabilities()).get('browserName') === 'chrome') {
await browser.get('chrome://downloads/');
const items = await browser.executeScript('return downloads.Manager.get().items_');
expect(items.length).toBe(1);
expect(items[0].file_name).toMatch(regex);
}
};
And today I had to reuse it and it throws an error :
Cannot read property 'get' of undefined
I think the issue is that downloads.Manager is undefined.
Has there been anything changes to Chrome api? Something has a new name?I couldnt find any documentation on this in the official chrome patch notes.
I tried looking through the downloads object but I could not find any property/method that lists downloaded items.
If you want to check whether there are downloads and they are done, this works:
var items = document.querySelector('downloads-manager').shadowRoot.querySelectorAll('#downloadsList downloads-item');
if (Array.from(items).every(i => i.shadowRoot.querySelector('#progress') == null || i.shadowRoot.querySelector('#progress').value === 100))
return Array.from(items).reduce((acc, curr) => [...acc, curr.shadowRoot.querySelector('div#content #file-link').href], []);
My code that was based on downloads.Manager broke too... I'd be great to have some information why it was removed.
edit: See here, someone had the same issue and there's a fix: https://support.google.com/chrome/thread/28267973?hl=en
You can get the first element (or change selector to get any other element) via selector:
const element = browser.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('downloads-item').shadowRoot.querySelector('a');");
Or get the text of the element via adding .innerText at the end
const elementWithText = browser.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('downloads-item').shadowRoot.querySelector('a').innerText;");
Look at the following answer https://stackoverflow.com/a/51346897/9332160
I can't speak to whether it has changed recently or not but I noticed that getCabilities returns a Map as a child of the main object. It could be the get() associated with this part which is generating the error and not the one inside the execute script. Can you try adding `['map_'] like below
if ((await browser.getCapabilities())['map_'].get('browserName') === 'chrome') {
I'm executing the Protractor test in angular web application.
Test Case:
Find elements in the list.
Loop through every element in the list.
If an element contains the required name.
Click at the element.
Code:
let projectsList = await element.all(by.css(Selectors.projectList));
for (item of projectsList) {
item.getText().then((text) => {
if (text.includes("50_projects_to_tests")) {
console.log(text)
item.clik()
}
}, (err) => console.log(err));
}
Problem:
The test case is straightforward to execute except one thing.
Request about updating information in the project is sending every few second.
When the response back from the server I'm loosing the selected list of projects before.
It means that I'm not able to click at the element which I found because the element no longer exists.
I receiving:
StaleElementReferenceError: stale element reference: element is not attached to the page document
Question:
Is it possible to block/freeze the DOM while the test is executing?
Any ideas would be appreciable to handle the issue.
Getting stale element references while looping is a common problem
First note is that you should try avoid using .then() for managing promises if you already are using async/await, it just makes things more difficult to read.
Secondly I would caution against disabling the refresh if that's not how the applicaiton works when a enduser will be interacting with it.
The following answer is based on the assumption that after the page refreshes the same elements will be found by element.all(by.css(Selectors.projectList));. In this answer the whole element array is recaptured during each loop but it stores the index value of the element it needs so the loop proceeds
let projectsList = await element.all(by.css(Selectors.projectList));
for(let loopCount = 0; loopCount < projectsList.length; loopCount++){
projectsList = await element.all(by.css(Selectors.projectList));
const item = projectsList[loopCount];
const itemText = await item.getText();
if (itemText.includes("50_projects_to_tests")) {
console.log(itemText )
item.clik()
}
}
I've got a PhoneGap application in development where I'm trying to use framework7's search bar to filter out my virtual list of products.
Current functionality is that the list works fine, but the search bar only searches through the rendered elements rather than the whole virtual list.
I've gone through framework7's documentation on getting their virtual list and searchbars to work together, but as far as I can tell the searchbar in my code completely ignores the virtual lists searchAll function which I put in. I can have searchAll() return anything and it makes no difference to the current functionality.
var listObject = {
items: selectProd,
template: '<li class="item-content"><div class="item-inner"><div data-value="{{model_id}}" class="item-title list-title">{{internal_descriptn}}</div></div></li></script>',
searchAll: function (query, items) {
var foundItems = [];
for (var i = 0; i < items.length; i++) {
// Check if title contains query string
if (items[i].title.indexOf(query.trim()) >= 0) foundItems.push(i);
}
// Return array with indexes of matched items
return foundItems;
}
};
console.log(listObject);
var virtualList = myApp.virtualList('#product-list', listObject);
var mySearchbar = myApp.searchbar('.searchbar', {
searchList: '.list-block-search',
searchIn: '.list-title'
});
I feel like the only thing I could be missing is some way to put the virtualList into the searchbar as an attribute or similar to link them, it seems strange for me to expect them to just work together like magic. Yet that seems to be what the documentation suggests it does (not in my case apparently or it would work). Thanks for any help.
Solved it by looking at an example on their github. At first glance everything is the same, so I copy it over and slowly change it back to include my data to see where the problem occurs. For some godamn reason, you need to use the class virtual-list to identify the parent containing your list. Then specify that class for your virtual list and for your search bar. Using a different name instead won't work. Very frustrating that this isn't mentioned anywhere at all in documentation.