Scraping advice on Crawling and info from Javascript onclick() function

Scraping advice on Crawling and info from Javascript onclick() function - javascript

I've finally found a thread on newbie help on this subject but I am no way forward with resolving this issue, partly because I'm a newbie at programming :)
The thread is:
Newbie: How to overcome Javascript "onclick" button to scrape web page?
I have a similar issue. The site I would like to scrape from has lots of information of a lot of parts, but I would like to only scrape certain part information (company, part number, etc). I have two issues:
How to grab such information from this site without the need to put in search information? Use a Crawler?
A part number has most of the information on a page but there is on page Javascript 'onclick()' function, when it is clicked opens up a small window displaying information that, in addition to, I would like to scrape. How can I scrape the information in this additional window?
I'm using import.io but have been advised to switch to Selenium and PhantomJS. I would welcome other suggestions, and not too complicated (or instructions provided, which would be awesome!), of other tools. I would really appreciate if someone can help me overcome this issue or provide instructions. Thank you.

If you are a newbie and you want to create a web crawler for data extraction then I would recommend selenium however, selenium webdriver is slower than scrapy (a python framework for coding web crawlers)
As you have been advised to use selenium, I will only focus on selenium using python.
For your first issue : "How to grab such information from this site"
Suppose the website from which you want to extract data is www.fundsupermart.co.in (selected this to show how to handle new window pop ups)
using selenium you can crawl by writing:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.fundsupermart.co.in/main/fundinfo/mutualfund-AXIS-BANKING-DEBT-FUND--GROWTH-AXS0077.html')
This will open the firefox browser webdriver and will load the page of the link provided in the get() method
Now suppose if you want to extract a table then you can extract by using its tag_name, xpath or class_name by using functions provided by selenium. Like here if I want to extract table under "Investment Objective" :
Then for this I will:
right click -> inspect element -> find the appropriate tag from console -> right click -> copy xpath
Here I found that <tbody> tag was the one from which I can extract the table so I right clicked that and clicked on copy xpath so I got the xpath of that tag i.e. :
xpath=/html/body/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr[1]/td/font/table/tbody/tr[1]/td/table/tbody/tr[5]/td/table/tbody
then, in the code add this line:
driver.find_element_by_xpath(xpath).text
Similarly you can extract other data from any website also see selenium's doc here
For you second issue : "How can I scrape the information in this additional window?"
For clicking the link you can use click() function provided by selenium. Suppose here I want to click the link : Click here for price history then I will get the xpath(as done previously) and add line :
driver.find_element_by_xpath(xpath).click()
I will open a new window like this :
Now to extract data from new window you will have to switch to new window which you can do by adding this line:
windows = driver.window_handles
driver.switch_to_window(windows[1])
Now, by doing this I have switched the webdriver to the new window and now I can extract data as I did earlier and to close this window and switch back to original window just add :
driver.close()
driver.switch_to_window(windows[0])
This was a very basic and naive approach of web crawlers using selenium. The tutorial given here is really good and will help you a lot.

Related

How to extract information from web page

I'm looking for a way to automatically extract information from a web page, more specifically an online game (https://www.virtualregatta.com/fr/offshore-jeu/).
In the game, I want to extract/copy the position of the boat. With Mozilla and its debug tools, I used the network debugger and I saw an HTML POST request containing what I want.
It seems that we receive as a response a json containing a structure with latitude/longitude.
This is perfect to me, but I want a more user friendly way to get it and I would need advices. Problem is that I'm really a beginner in web development haha.
Is it possible to do this using a script ? (But I suppose it will be complicated to first log into the game)
Is it possible to create a basic Mozilla plugin which would be able to catch the request/response and copy the position to clipboard for me ?
anything else ?
EDIT:
I've tried using a Mozilla plugin, and I achieved to add a listener on POST request. I see the request to get the boat information but I can't find a way to get the json response in JS.
function logURL(responseDetails) {
console.log(responseDetails);
}
browser.webRequest.onResponseStarted.addListener(
logURL,
{urls: ["*://*.virtualregatta.com/getboatinfos"]}
);

In Chrome I use Broomo for this purposes. It helps you to add scripts in web pages, you can console.log the POST you found, and of course you can create functions and Use the webpage Backend.
In firefox I found this one js-injector. But I didn't use it before.
Update:
Now there are a new extension for both browsers:
Chrome: ABC JS-CSS Injector
Firefox: ABC JS-CSS Injector

Sending keys via Selenium to Google Auth fails (using Python and Firefox)

I am following a Django Tutorial by Marina Mele, which is pretty good but a bit outdated since it was last updated in 2016, I believe. I am now trying the Selenium testing and ran into the problem that I can send my e-mail address via Selenium but not the password. My code is
self.get_element_by_id("identifierId").send_keys(credentials["Email"])
self.get_button_by_id("identifierNext").click()
self.get_element_by_tag('input').send_keys(credentials["Passwd"])
self.get_button_by_id("passwordNext").click()
with these functions being defined as:
def get_element_by_id(self, element_id):
return self.browser.wait.until(EC.presence_of_element_located(
(By.ID, element_id)))
def get_element_by_tag(self, element_tag):
return self.browser.wait.until(EC.presence_of_element_located(
(By.TAG_NAME, element_tag)))
def get_button_by_id(self, element_id):
return self.browser.wait.until(EC.element_to_be_clickable(
(By.ID, element_id)))
Most advices that I read to this issue circled around waiting until the element appears. However, this is covered through these functions. And I am using by_tag since the current version of Google Authentication is using an input for the password field that has not an ID but is a div/div/div child of the div with the "passwordIdentifier"-id. I have also tried using Xpath but it seems that this does not make a difference.
Also, it seems like Selenium is capable of finding the elements...at least when I check with print commands. So, locating the element seems not to be the problem. However, Selenium fails to send the keys from what I can see when I look at what happens in the Firefox browser, while Selenium is testing. What could be the issue? Why is Selenium struggling to send the password keys to the Authentication form?
Thanks to everyone in advance!

When you search an input on google registration page, you will find 8 WebElements. I think it is the origin of your problem.
I would use another localizer such as an xpath = //input[#name='password'] or a By on the name instead of the tag name, as implemented below:
def get_element_by_name(self, element_tag):
return self.browser.wait.until(EC.presence_of_element_located(
(By.NAME, element_tag)))
and:
self.get_element_by_id("identifierId").send_keys(credentials["Email"])
self.get_button_by_id("identifierNext").click()
self.get_element_by_name('password').send_keys(credentials["Passwd"])
self.get_button_by_id("passwordNext").click()

C# script that detect already opened website and click buttons and inserta data

I would like to write simple scripts which after I have already opened site ( I dont wanna script to open it) press two buttons and insert data in comment section after pressing f.ex. 'g' button. I am completly new in that kind of programming so any help will be nice( also link to good tutorials).
webBrowser1.Document.GetElementById("User").SetAttribute("value", textBox1.Text);
webBrowser1.Document.GetElementById("but").InvokeMember("click");
I am aware of those 2 functions i will use but how to instantiate them on already opened page by pressing a button? (If thats important deafult used browser is opera).

You should use something like Selenium (http://www.seleniumhq.org/) which is a browser automation framework.
Selenium scripts can be written in many languages (including c#) and the scripts can be run on a variety of browsers. There is even browser plugins for creating scripts my recording a macro - no code required!
This is much more robust that using a browser control embedded in an app as that is only a cut down version of internet explorer I believe.
This is a rough sample of selenium in c#
using OpenQA.Selenium;
using OpenQA.Selenium.IE;
using OpenQA.Selenium.Support.UI;
var options = new InternetExplorerOptions();
options.IntroduceInstabilityByIgnoringProtectedModeSettings = true;
Driver = new InternetExplorerDriver(options);
Driver.Navigate().GoToUrl("yourURL");
Driver.FindElement(By.Id("User")).SendKeys("<your text>");
Driver.FindElement(By.Id("but")).Click();

Scrape sub elements in tables which cannot found in html but only in Chrome>F12>Element

I have tried to scrape scoring/event time and also player name http://en.gooooal.com/soccer/analysis/8401/events_840182.html.However cannot work.
require(RCurl);
require(XML);
lnk = "http://en.gooooal.com/soccer/analysis/8401/events_840182.html";
doc = htmlTreeParse(lnk,useInternalNodes=TRUE);
x = unlist(xpathApply(doc, "//table/tr/td"));
normal html page doesn't show the details of the table contents.
the nodes only can get from
>>> open Chrome >>> click F12 >>> click Element
Can someone help? Thanks a lot.

If you reload the page while Chrome developer tools are active, you can see that real data is fetched via XHR from http://en.gooooal.com/soccer/analysis/8401/goal_840182.js?GmFEjC8MND. This URL contains event id 840182 which you can scrape from the page. The part after ? seems to be just a way to circumvent browser caching. 8401, again, seems to be just first digits of the id.
So, you can load the original page, construct the second URL, and get real data from there.
Anyway... In most cases it's a morally questionalble practice to scrape data from web sites. I hope you know what you're doing :)

It sounds as if the content was inserted asynchronously using javascript, so using Curl won't help you there.
You'll need a headless browser which can actually parse and execute javascript (If you know ruby you could start looking for the cucumber-selenium-chromedriver combo), or maybe just use your browser with greasemonkey/tampermonkey to actually mimic a real user browsing the score scraping.

The contents are probably generated (by Javascript, like from an ajax call) after loading the (HTML) page. You can check that by loading the page in Chrome after disabling Javascript.
I don't think you can instruct RCurl to execute Javascript...

Python Selenium get javascript document

I have a webpage that contains some information that I am interested in. However, those information are generated by Javascript.
If you do something similar like below:
browser = webdriver.Chrome()
browser.set_window_size(1000, 1000)
browser.get('https://www.xxx.com') # cannot make the web public, sorry
print browser.page_source
It only print out a few javascript functions and some headers which doesn't contain that information that I want - Description of Suppliers, etc... So, when I try to collect those information using Selenium, the browser.find_element_by_class_name would not find the element I want successfully either.
I tried the code below assuming it would has the same effect as typing document in the javascript console, but obviously not.
result = browser.execute_script("document")
print result
and it returns NULL...
However, if I open up the page in Chrome, right click the element and inspect element. I could see the populated source code. See the attached picture.
Also, I was inspired by this commend that helps a lot.
I could open up the javascript console in Chrome, and if I type in
document
I could see the complete html sitting there, which is exactly what I want. I am wondering is there a way to store the js populated source code using selenium?
I've read some posts saying that it requires some security work to store the populated document to client's side.
Hope I have made myself clear and appreciates any suggestion or correction.
(Note, I have zero experience with JS so detailed explaination would be gratefully appreciated!)

Develop Reference

JavaScript is the programming language of the Web.