I am trying to scrape the following Javascript frontend website to practise my Javascript scraping skills:
https://www.oplaadpalen.nl/laadpaal/112618
I am trying to find two different elements by their xPath. The first one is the title, which it does find. The second one is the actual text itself, which it somehow fails to find. It's strange since I just copied the xPath's from Chrome browser.
from selenium import webdriver
link = 'https://www.oplaadpalen.nl/laadpaal/112618'
driver = webdriver.PhantomJS()
driver.get(link)
#It could find the right element
xpath_attribute_title = '//*[#id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/label'
next_page_elem_title = driver.find_element_by_xpath(xpath_attribute_title)
print(next_page_elem_title.text)
#It fails to find the right element
xpath_attribute_value = '//*[#id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/text()'
next_page_elem_value = driver.find_element_by_xpath(xpath_attribute_value)
print(next_page_elem_value.text)
I have tried a couple of things: change "text()" into "text", "(text)", but none of them seem to work.
I have two questions:
Why doesn't it find the correct element?
What can we do to make it find the correct element?
Selenium's find_element_by_xpath() method returns the first element node matching the given XPath query, if any. However, XPath's text() function returns a text nodeānot the element node that contains it.
To extract the text using Selenium's finder methods, you'll need to find the containing element, then extract the text from the returned object.
Keeping your own logic intact you can extract the labels and the associate value as follows :
for x in range(3, 8):
label = driver.find_element_by_xpath("//div[#class='labels']//following::div[%s]/label" %x).get_attribute("innerHTML")
value = driver.find_element_by_xpath("//div[#class='labels']//following::div[%s]" %x).get_attribute("innerHTML").split(">")[2]
print("Label is %s and value is %s" % (label, value))
Console Output :
Label is Paalcode: and value is NewMotion 04001157
Label is Adres: and value is Deventerstraat 130
Label is pc/plaats: and value is 7321cd Apeldoorn
I would suggest a slightly different approach. I would grab the entire text and then split one time on :. That will get you the title and the value. The code below will get Paalcode through openingstijden labels.
for x in range(2, 8):
s = driver.find_element_by_css_selector("div.leftblock > div.labels > div")[x].text
t = s.split(":", 1)
print(t[0]) # title
print(t[1]) # value
You don't want to split more than once because Status contains more semicolons.
Going with #JeffC's approach, if you want to first select all those elements using xpath instead of css selector, you may use this code:
xpath_title_value = "//div[#class='labels']//div[label[contains(text(),':')] and not(div) and not(contains(#class,'toolbox'))]"
title_and_value_elements = driver.find_elements_by_xpath(xpath_title_value)
Notice the plural elements in the find_elements_by_xpath method. The xpath above selects div elements that are descendants of a div element that had a class attribute of "labels". The nested label of each selected div must contain a colon. Furthermore, the div itself may not have a class of "toolbox" (Something that certain other divs on the page have), nor must it contain any additional nested divs.
Following which, you can extract the text within the individual div elements (which also contain the text from the nested label elements) and then split them using ":\n" which separates the title and value in the raw text string.
for element in title_and_value_elements:
element = element.text
title,value = element.split(":\n")
print(title)
print(value,"\n")
Since you want to practice JS skills you can do this also in JS, actually all the divs contain more data, you can see if you do paste this in the browser console:
labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) console.log(div.firstChild, div.textContent);
you can push to an array and check only divs and that have label and return the resulted array in a python variable:
labels_value_pair.driver.execute_script('''
scrap = [];
labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) if (div.firstChild.tagName==="LABEL") scrap.push(div.firstChild.textContent, div.textContent);
return scrap;
''')
Related
i am new to js.
can you tell me why I am getting empty values for sports-title and third.
since we have one div with content in it.
sports-title---->{"0":{}}
third---->{}
providing my code below.
findStringInsideDiv() {
/*
var str = document.getElementsByClassName("sports-title").innerHTML;
*/
var sportsTitle = document.getElementsByClassName("sports-title");
var third = sportsTitle[0];
var thirdHTML = third.innerHTML
//str = str.split(" ")[4];
console.log("sports-title---->" + JSON.stringify(sportsTitle));
console.log("third---->" + JSON.stringify(third));
console.log("thirdHTML---->" + JSON.stringify(thirdHTML));
if ( thirdHTML === " basketball football swimming " ) {
console.log("matching basketball---->");
var menu = document.querySelector('.sports');
menu.classList.add('sports-with-basketball');
// how to add this class name directly to the first div after body.
// but we are not rendering that div in accordion
//is it possible
}
else{
console.log("not matching");
}
}
When you call an object in the Document Object Model (DOM) using any of the GetElement selectors, it returns an object that can be considered that HTML element. This object includes much more than just the text included in the HTML element. In order to access the text of that element, you want to use the .textContent property.
In addition, an HTML class can potentially be assigned to several elements and therefore GetElementsByClassName returns an array so you would have to do the following, for example:
console.log("sports-title---->" + JSON.stringify(sportsTitle[0].textContent));
You can find a brief introduction to the DOM on the W3Schools Website. https://www.w3schools.com/js/js_htmldom.asp If you follow along it gives an overview of different aspects of the DOM including elements.
Maybe this would be helpful
As you see sportsTitle[0].textContent returns full heading and 0 is the index thus you get "0" when you stringify (serialize) sportsTitle. Why 0? Because you have one <h1> element . See this fiddle http://jsfiddle.net/cqj6g7f0/3/
I added second h1 and see the console.log and you get two indexes 0 and 1
if you want to get a word from element so get substring use substr() method https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr
One way is to change <h1> class attr to id and do sportsTitle.textContent;
and use substr() on this string
or
2nd way is to remain class attr and do sportsTitle[0].textContent;
and substr() on this string
The 2nd is the better way
I am trying to access child element of an ng-repeat element but I am having troubles doing that.
I have searched around about the problem and the solutions that I have found did not work for me. One of those solutions was to do something like this:
var parent = element(by.repeater(''));
var child = parent.element(by.....);
When I try the child line I cant see the element function on the parent element..
http://prikachi.com/images/11/8338011u.png
If you see the screenshot above you will see the structure of the code of the page that I am trying to test.
I need to access the alt attribute of the image of the avatar and get its value (thats the Username of the User).
One thing that came to my mind is to use .getInnerHTML() on the ng-repeat row which will return a string with all that code. From there I can find the alt attribute and its value with string manipulation but this seems too brute and I am sure that there has to be a better way.
Simply I want to be able to get row 4 from the repeater and get the Username of the user at row 4, that's all I wanna do actually.
Try this,
var parent = element(by.repeater('f in feed'));
var child = parent.all(by.xpath('//img[#alt="Pundeep"]')).first()
(or)
var parent = element(by.repeater('f in feed'));
var child = parent.all(by.xpath('//img[#alt="Pundeep"]')).get(0)
You can get it directly using element.all() and get() locator in protractor. Here's how -
var child = element.all(by.repeater('parent_locator')).get(3); //gets 4th element in repeater. Its a 0 based index.
child.getAttribute('alt').then(function(user){
var username = user; //username contains the alt text
});
Hope this helps.
In Protractor element documentation it gives an example like this to find child elements, which is same as chaining element find:
// Chain 2 element calls.
let child = element(by.css('.parent')).
$('.child');
expect(child.getText()).toBe('Child text\n555-123-4567');
// Chain 3 element calls.
let triple = element(by.css('.parent')).
$('.child').
element(by.binding('person.phone'));
expect(triple.getText()).toBe('555-123-4567');
// Or using the shortcut $() notation instead of element(by.css()):
// Chain 2 element calls.
let child = $('.parent').$('.child');
expect(child.getText()).toBe('Child text\n555-123-4567');
// Chain 3 element calls.
let triple = $('.parent').$('.child').
element(by.binding('person.phone'));
expect(triple.getText()).toBe('555-123-4567');
https://www.protractortest.org/#/api?view=ElementFinder.prototype.$
this example could help :
return element(by.css('select.custom-select:nth-child(1) option[value="12"]'));
you can use nth-child() selector to access to a child element.
In my example i used a plugin with 2 select with same classes and i wanted to click on a defined option in the select 1, and a second in the select 2.
I need to find the length of text (ie. number of characters) of text within a specified div (#post_div) EXCLUDING HTML formatting AND the content of a NON specific span . So any embedded span that is NOT #span1 #span2 needs to be excluded from the count.
So far I have the following solution which works, but it adds/removes from the DOM which I would prefer not to do.
var post = $("#post_div");
var post2 = post.html(); //duplicating for later
post.find("span:not(#span1):not(#span2)").remove(); //removing unwanted (only for character count) spans from DOM - YUCK!
post = $.trim(post.text());
console.log(post.length); // The correct length is here.
$("#post_div").html(post2); //replacing butchered DIV with original duplicate in DOM - YUCK!
I would prefer to achieve the same result, but without butchering the DOM/adding/replacing things from it for a simple character count.
Hope that makes sense
Instead of duplicating the HTML then working on the original node, duplicate the node and work on it outside of the main DOM tree.
var post = $("#post_div").clone();
post.find("span:not(.post_tag):not(.post_mentioned)").remove();
post = $.trim(post.text());
console.log(post.length); // The correct length is here.
Actually, the simple
var t = $.trim($("#post_div span.post_tag, #post_div span.post_mentioned").text());
console.log(t.length);
Should Suffice.
However, if you have textual content Outside of span Elements, you would have to use
var t = $.trim($("#post_div").text());
var t_inner = $("#post_div span:not(.post_tag):not(.post_mentioned)").text());
console.log(t.length - t_inner.length);
I am creating a templating system which can be interpreted at client side with Javascript to construct a fill in the blanks form e.g. for a letter to a customer etc.
I have the template constructed and the logic set out in pseudo code, however my unfamiliarity with jQuery I could use some direction to get me started.
The basic idea is there is a markup in my text node that denotes a field e.g. ${prologue} this is then added to an array called "fields" which will then be used to search for corresponding node names in the xml.
XML
<?xml version="1.0" encoding="UTF-8"?>
<message>
<text>${Prologue} - Dear ${Title} ${Surname}. This is a message from FUBAR. An engineer called but was unable to gain access, a new appointment has been made for ${ProductName} with order number ${VOLNumber}, on ${AppointmentDate} between ${AppointmentSlot}.
Please ensure you are available at your premises for the engineer. If this is not convenient, go to fubar.com or call 124125121515 before 12:00 noon the day before your appointment. Please refer to your order confirmation for details on what will happen on the day. ${Epilogue} - Free text field for advisor input<
</text>
<inputTypes>
<textBox type="text" fixed="n" size="100" alt="Enter a value">
<Prologue size="200" value="BT ENG Appt Reschedule 254159" alt="Prologue field"></Prologue>
<Surname value="Hoskins"></Surname>
<ProductName value=""></ProductName>
<VOLNumber size="8" value="" ></VOLNumber>
<Epilogue value=""></Epilogue>
</textBox>
<date type="datePicker" fixed="n" size="8" alt="Select a suitable appointment date">
<AppointmentDate></AppointmentDate>
</date>
<select type="select" >
<Title alt="Select the customers title">
<values>
<Mr selected="true">Mr</Mr>
<Miss>Miss</Miss>
<Mrs>Mrs</Mrs>
<Dr>Dr</Dr>
<Sir>Sir</Sir>
</values>
</Title>
<AppointmentSlot alt="Select the appointment slot">
<values>
<Morning>9:30am - 12:00pm</Morning>
<Afternoon>1:00pm - 5:00pm</Afternoon>
<Evening>6:00pm - 9:00pm</Evening>
</values>
</AppointmentSlot>
</select>
</inputTypes>
</message>
Pseudocode
Get list of tags from text node and build array called "fields"
For each item in "fields" array:
Find node in xml that equals array item's name
Get attributes of that node
Jump to parent node
Get attributes of parent node
If attributes of parent node != child node then ignore
Else add the parent attributes to the result
Build html for field using all the data gathered from above
Addendums
Is this logic ok, is it possible to start at the parent of the node and navigate downwards instead?
Also with regards to inheritence could we get the parent attributes and if the child attributes are different then add them to the result? What about if the number of attributes in the parent does not equal the number in the child?
Please do not provide fully coded solutions, just a little teasers to get me started.
Here is what I have so far which is extracting the tags from text node
//get value of node "text" in xml
var start = $(xml).find("text").text().indexOf('$');
var end = $(xml).find("text").text().indexOf('}');
var tag = "";
var inputType;
// find all tags and add them to a tag array
while (start >= 0)
{
//console.log("Reach In Loop " + start)
tag = theLetter.slice(start + 2, end);
tagArray.push(tag);
tagReplaceArray.push(theLetter.slice(start, end + 1));
start = theLetter.indexOf('$', start + 1);
end = theLetter.indexOf('}', end + 1);
}
Any other recommendations or links to similar problems would be welcome.
Thankyou!
I am using a similar technique to do html templating.
Instead of working with elements, I find it easier to work with a string and then convert it to html. In your case with jQuery, you could do something similar:
Have your xml as a string:
var xmlString='<?xml version="1.0" encoding="UTF-8"?><message><text>${Prologue} - Dear ${Title} ${Surname}... ';
Iterate through the string to do the replacements with a regex ($1 is the captured placeholder, for example Surname):
xmlString.replace(/$\{([^}]+)}/g,function($0,$1)...}
Convert to nodes if needed:
var xml=$(xmlString);
The benefits of the regex:
faster (just a string, you're not walking the DOM)
global replace (for example if Surname appears several times), just loop through your object properties once
simple regex /${([^}]+)}/ to target the placeholder
Get list of tags from text node and build array called "fields"
To create the array I would rather user regular expression, this is one of the best use for it (in my opinion) because we are indeed searching for a pattern :
var reg = /\$\{(\w+)\}/gm;
var i = 0;
var fields = new Array();
while ( (m = reg.exec(txt)) !== null)
{
fields[i++] = m[1];
}
For each item in "fields" array
jQuery offers some utility functions :
To iterate through your fields you could do this : $.each(fields, function(index, value){});
Navigating through the nodes and retrieving the values
Just use the jQuery function like you are already doing.
Building the HTML
I would create templates objects for each types you would take in charge (in this example : Text, Select)
Then using said templates you could replace the tokens with the HTML of your templates.
Displaying the HTML
Last step would be to parse the result string and append it at the right place:
var ResultForm = $.parseHTML(txt);
$("#DisplayDiv").append(ResultForm);
Conclusion
Like you asked, I did not prepare anything that works right out of the box, I hope it will help you prepare your own answer. (And then I hope you will share it with the community)
This is just a framework to get you going, like you asked.
first concept is using a regex to just find all matches of ${ }. it returns an array like ["${one}","${t w 0 }","${ three}"].
second concept is a htmlGenerator json object mapping "inputTypes-->childname" to a function responsible for the html print out.
third is not to forget about natural javascript. .localname will give you the xml element's name, and node.attributes should give you a namedNodeMap back (remember not to perform natural javascript against the jquery object, make sure you're referencing the node element jQuery found for you).
the actual flow is simple.
find all the '${}'tokens and store the result in an array.
find all the tokens in the xml document and using their parents info, store the html in an map of {"${one}":"<input type='text' .../>","${two}":"<select><option value='hello'>world!</option></select>" ...}
iterate through the map and replace every token in the source text with the html you want.
javascript
var $xmlDoc = $(xml); //store the xml document
var tokenSource =$xmlDoc.find("message text").text();
var tokenizer=/${[^}]+/g; //used to find replacement locations
var htmlGenerators = {
"textBox":function(name,$elementParent){
//default javascript .attributes returns a namedNodeMap, I think jquery can handle it, otherwise parse the .attributes return into an array or json obj first.
var parentAttributes = ($elementParent[0] && $elementParent.attributes)?$elementParent.attributes:null;
//this may be not enough null check work, but you get the idea
var specificAttributes =$elementParent.find(name)[0].attributes;
var combinedAttributes = {};
if(parentAttributes && specificAttributes){
//extend or overwrite the contents of the first obj with contents from 2nd, then 3rd, ... then nth [$.extend()](http://api.jquery.com/jQuery.extend/)
$.extend(combinedAttributes,parentAttributes,specificAttributes);
}
return $("<input>",combinedAttributes);
},
"date":function(name,$elementParent){
//whatever you want to do for a 'date' text input
},
"select":function(name,$elementParent){
//put in a default select box implementation, obviously you'll need to copy options attributes too in addition to their value / visible value.
}
};
var html={};
var tokens = tokenSource.match(tokenizer); //pull out each ${elementKey}
for(index in tokens){
var elementKey = tokens[index].replace("${","").replace("}"),"");//chomp${,}
var $elementParent = $xmlDoc.find(elementKey).parent();//we need parent attributes. javascript .localname should have the element name of your xml node, in this case "textBox","date" or "select". might need a [0].localname....
var elementFunction = ($elementParent.localname)?htmlGenerators[elementParent.localname]:null; //lookup the html generator function
if(elementFunction != null){ //make sure we found one
html[tokens[index]] = elementFunction(elementKey,elementParent);//store the result
}
}
for(index in html){
//for every html result, replace it's token
tokenSource = tokenSource.replace(index,html[index]);
}
I have been banging my head on this for the better part of a day; I need to count the number of childNodes in a parent div. It basically is acting like a list and each childNode is a row I want to count. The html looks like:
div<#class="list ">
div<#id="list-item-01">
div<#id="list-item-02">
div<#id="list-item-03">
div<#id="list-item-04">
div<#id="list-item-05">
...
</div>
My primary approach has been to use the getEval() function in Selenium using some javascript.
examples that have failed:
String locator = "xpath=//div[contains(#class,'list')]";
String jscript = "var element = this.browserbot.findElement('"+locator+"');";
jscript += "element.childNodes.length;";
String locator = "xpath=//div[#class='list']";
String jscript = "var element = this.browserbot.findElement('"+locator+"');";
jscript += "element.childNodes.length;";
Now I know the element is there and my xpath is correct because I have tried using .isElementPresent and that returns true. So something is funky with Selenium and divs.
I also poked around with document.evaluate() as my javascript command but that proved equally fruitless.
Why not use getXPathCount? Something like
getXPathCount("//div[contains(#class, 'list ')]/div[contains(#class, 'list-item-')]")
should do the trick.
.Net documentation
Java documentation
So these divs are created dynamically .
So when you create this , you will be using a variable for iteration like $i , $i++.
after printing divs add that value of $i to a hidden field.
if there are 3 divs , $i=3 , putvalue of hidden=3
just get the value of that field using javascript.
Or
Try these
http://api.jquery.com/parent/
http://api.jquery.com/children/