Get XPath of a Facebook page post using HtmlUnit

Get XPath of a Facebook page post using HtmlUnit - javascript

I want to get the xpath of a facebook post using HtmlUnit. You can refer these two questions to get more ideas on what I want to do:
Supernatural behaviour with a facebook page
HtmlUnit commenting out lines of facebook page
To simulate what I did, you can follow q-1. The pastebin link of HTML code(of facebook page) is http://pastebin.com/MfXsYSJQ.
Or simply you can go to https://www.facebook.com/bhramakarserver .
I just want to get the xpath of the span containing the post with text:"Hi! this is the first post of this page."
What I tried was this:
public class ForStackOverflow {
public static void main(String[] args) throws IOException {
WebClient client=new WebClient(BrowserVersion.FIREFOX_17);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page1=client.getPage("https://www.facebook.com/bhramakarserver");
System.out.println(page1.asXml());
//getting the xpath of span of class="userContent"
HtmlInput input=(HtmlInput)page1.getByXPath("/html/body//input[#type='submit']").get(0);
System.out.println(input.asXml());
//This line gives error as the xpath evaluates to null
HtmlSpan span=(HtmlSpan)page1.getByXPath("/html/body//span[#class='userContent']").get(0);
}
}
The problem which seems is that the page1 has the static html. In this, the span element:
<span data-ft="{"tn":"K"}" class="userContent">Hi! this is the first post of this page.</span>
is generated dynamically. So it appears as commented in html of page1.But on inspection via inspect element, it appears as normal. Hence its dynamically uncommented.Is there no way that I can get page1's html to be in the state after all its dynamic contents have been loaded so that I may get the xpath correctly? Can it be done using selenium web-driver?

Given that information, it seems fair to assume that some AJAX call is not being fired or that you're not properly waiting for the AJAX to execute. I haven't gotten the best results using that AJAX controller. Sadly, a loop is usually the best way to go.
I've explained how to do that in this question: Get the changed HTML content after it's updated by Javascript? (htmlunit)
If this doesn't do the trick, then probably you're getting a JavaScript exception. I've written some possible workarounds to that situation in this other question: How to overcome an HTMLUnit ScriptException?
If none of these work... then I'd recommend using something else rather than HTMLUnit. Any real browser drive would do the trick. Or maybe using some other alternative such as PhantomJS or ZombieJS.

Related

Sending keys via Selenium to Google Auth fails (using Python and Firefox)

I am following a Django Tutorial by Marina Mele, which is pretty good but a bit outdated since it was last updated in 2016, I believe. I am now trying the Selenium testing and ran into the problem that I can send my e-mail address via Selenium but not the password. My code is
self.get_element_by_id("identifierId").send_keys(credentials["Email"])
self.get_button_by_id("identifierNext").click()
self.get_element_by_tag('input').send_keys(credentials["Passwd"])
self.get_button_by_id("passwordNext").click()
with these functions being defined as:
def get_element_by_id(self, element_id):
return self.browser.wait.until(EC.presence_of_element_located(
(By.ID, element_id)))
def get_element_by_tag(self, element_tag):
return self.browser.wait.until(EC.presence_of_element_located(
(By.TAG_NAME, element_tag)))
def get_button_by_id(self, element_id):
return self.browser.wait.until(EC.element_to_be_clickable(
(By.ID, element_id)))
Most advices that I read to this issue circled around waiting until the element appears. However, this is covered through these functions. And I am using by_tag since the current version of Google Authentication is using an input for the password field that has not an ID but is a div/div/div child of the div with the "passwordIdentifier"-id. I have also tried using Xpath but it seems that this does not make a difference.
Also, it seems like Selenium is capable of finding the elements...at least when I check with print commands. So, locating the element seems not to be the problem. However, Selenium fails to send the keys from what I can see when I look at what happens in the Firefox browser, while Selenium is testing. What could be the issue? Why is Selenium struggling to send the password keys to the Authentication form?
Thanks to everyone in advance!

When you search an input on google registration page, you will find 8 WebElements. I think it is the origin of your problem.
I would use another localizer such as an xpath = //input[#name='password'] or a By on the name instead of the tag name, as implemented below:
def get_element_by_name(self, element_tag):
return self.browser.wait.until(EC.presence_of_element_located(
(By.NAME, element_tag)))
and:
self.get_element_by_id("identifierId").send_keys(credentials["Email"])
self.get_button_by_id("identifierNext").click()
self.get_element_by_name('password').send_keys(credentials["Passwd"])
self.get_button_by_id("passwordNext").click()

Android Headless Browsing through WebView?

I am trying to create an Android app for a Website, Which is not mine. But is a search engine for Restaurants. They have no API to work with. And i want to heedlessly browse their website and put the search query in the HTML Form and Click the Submit Button. And then Parse the Results and Use it with my Application Code. After doing loads of research here, i am finally asking for it. Question 1, Question 2, Question 3 and many more that i have looked so far. So all i know so far is if i want to do the same on Google.com i would write:
myWebView.getSettings().setJavaScriptEnabled(true);
myWebView.loadUrl("http://www.google.com/");
myWebView.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url) {
//Load HTML
myWebView.loadUrl("javascript:document.getElementById('q') =" + "StackOverFlow" + "; document.getElementByName('btnK').click();");
}
});
In the above code i am trying to put the search term "StackOverFlow" and Click the Search Button. But its not working. Kindly Help me out in this code or either point me in the right direction.

It's been a while, but for the sake of letting others know, webviews no longer use loadUrl to run Javascript. Try using evaluateJavascript.
Since you've also mentioned headless browsing, I would recommend overriding shouldInterceptRequest in your client to redirect all unnecessary files (such as css, images, and perhaps js depending on the site) to a blank inputstream

myWebView.loadUrl("http://www.google.com/");
after overriding onPageFinished method not before

Scraping advice on Crawling and info from Javascript onclick() function

I've finally found a thread on newbie help on this subject but I am no way forward with resolving this issue, partly because I'm a newbie at programming :)
The thread is:
Newbie: How to overcome Javascript "onclick" button to scrape web page?
I have a similar issue. The site I would like to scrape from has lots of information of a lot of parts, but I would like to only scrape certain part information (company, part number, etc). I have two issues:
How to grab such information from this site without the need to put in search information? Use a Crawler?
A part number has most of the information on a page but there is on page Javascript 'onclick()' function, when it is clicked opens up a small window displaying information that, in addition to, I would like to scrape. How can I scrape the information in this additional window?
I'm using import.io but have been advised to switch to Selenium and PhantomJS. I would welcome other suggestions, and not too complicated (or instructions provided, which would be awesome!), of other tools. I would really appreciate if someone can help me overcome this issue or provide instructions. Thank you.

If you are a newbie and you want to create a web crawler for data extraction then I would recommend selenium however, selenium webdriver is slower than scrapy (a python framework for coding web crawlers)
As you have been advised to use selenium, I will only focus on selenium using python.
For your first issue : "How to grab such information from this site"
Suppose the website from which you want to extract data is www.fundsupermart.co.in (selected this to show how to handle new window pop ups)
using selenium you can crawl by writing:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.fundsupermart.co.in/main/fundinfo/mutualfund-AXIS-BANKING-DEBT-FUND--GROWTH-AXS0077.html')
This will open the firefox browser webdriver and will load the page of the link provided in the get() method
Now suppose if you want to extract a table then you can extract by using its tag_name, xpath or class_name by using functions provided by selenium. Like here if I want to extract table under "Investment Objective" :
Then for this I will:
right click -> inspect element -> find the appropriate tag from console -> right click -> copy xpath
Here I found that <tbody> tag was the one from which I can extract the table so I right clicked that and clicked on copy xpath so I got the xpath of that tag i.e. :
xpath=/html/body/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr[1]/td/font/table/tbody/tr[1]/td/table/tbody/tr[5]/td/table/tbody
then, in the code add this line:
driver.find_element_by_xpath(xpath).text
Similarly you can extract other data from any website also see selenium's doc here
For you second issue : "How can I scrape the information in this additional window?"
For clicking the link you can use click() function provided by selenium. Suppose here I want to click the link : Click here for price history then I will get the xpath(as done previously) and add line :
driver.find_element_by_xpath(xpath).click()
I will open a new window like this :
Now to extract data from new window you will have to switch to new window which you can do by adding this line:
windows = driver.window_handles
driver.switch_to_window(windows[1])
Now, by doing this I have switched the webdriver to the new window and now I can extract data as I did earlier and to close this window and switch back to original window just add :
driver.close()
driver.switch_to_window(windows[0])
This was a very basic and naive approach of web crawlers using selenium. The tutorial given here is really good and will help you a lot.

Scrape sub elements in tables which cannot found in html but only in Chrome>F12>Element

I have tried to scrape scoring/event time and also player name http://en.gooooal.com/soccer/analysis/8401/events_840182.html.However cannot work.
require(RCurl);
require(XML);
lnk = "http://en.gooooal.com/soccer/analysis/8401/events_840182.html";
doc = htmlTreeParse(lnk,useInternalNodes=TRUE);
x = unlist(xpathApply(doc, "//table/tr/td"));
normal html page doesn't show the details of the table contents.
the nodes only can get from
>>> open Chrome >>> click F12 >>> click Element
Can someone help? Thanks a lot.

If you reload the page while Chrome developer tools are active, you can see that real data is fetched via XHR from http://en.gooooal.com/soccer/analysis/8401/goal_840182.js?GmFEjC8MND. This URL contains event id 840182 which you can scrape from the page. The part after ? seems to be just a way to circumvent browser caching. 8401, again, seems to be just first digits of the id.
So, you can load the original page, construct the second URL, and get real data from there.
Anyway... In most cases it's a morally questionalble practice to scrape data from web sites. I hope you know what you're doing :)

It sounds as if the content was inserted asynchronously using javascript, so using Curl won't help you there.
You'll need a headless browser which can actually parse and execute javascript (If you know ruby you could start looking for the cucumber-selenium-chromedriver combo), or maybe just use your browser with greasemonkey/tampermonkey to actually mimic a real user browsing the score scraping.

The contents are probably generated (by Javascript, like from an ajax call) after loading the (HTML) page. You can check that by loading the page in Chrome after disabling Javascript.
I don't think you can instruct RCurl to execute Javascript...

Python Selenium get javascript document

I have a webpage that contains some information that I am interested in. However, those information are generated by Javascript.
If you do something similar like below:
browser = webdriver.Chrome()
browser.set_window_size(1000, 1000)
browser.get('https://www.xxx.com') # cannot make the web public, sorry
print browser.page_source
It only print out a few javascript functions and some headers which doesn't contain that information that I want - Description of Suppliers, etc... So, when I try to collect those information using Selenium, the browser.find_element_by_class_name would not find the element I want successfully either.
I tried the code below assuming it would has the same effect as typing document in the javascript console, but obviously not.
result = browser.execute_script("document")
print result
and it returns NULL...
However, if I open up the page in Chrome, right click the element and inspect element. I could see the populated source code. See the attached picture.
Also, I was inspired by this commend that helps a lot.
I could open up the javascript console in Chrome, and if I type in
document
I could see the complete html sitting there, which is exactly what I want. I am wondering is there a way to store the js populated source code using selenium?
I've read some posts saying that it requires some security work to store the populated document to client's side.
Hope I have made myself clear and appreciates any suggestion or correction.
(Note, I have zero experience with JS so detailed explaination would be gratefully appreciated!)

Develop Reference

JavaScript is the programming language of the Web.