Python Selenium get javascript document - javascript

I have a webpage that contains some information that I am interested in. However, those information are generated by Javascript.
If you do something similar like below:
browser = webdriver.Chrome()
browser.set_window_size(1000, 1000)
browser.get('https://www.xxx.com') # cannot make the web public, sorry
print browser.page_source
It only print out a few javascript functions and some headers which doesn't contain that information that I want - Description of Suppliers, etc... So, when I try to collect those information using Selenium, the browser.find_element_by_class_name would not find the element I want successfully either.
I tried the code below assuming it would has the same effect as typing document in the javascript console, but obviously not.
result = browser.execute_script("document")
print result
and it returns NULL...
However, if I open up the page in Chrome, right click the element and inspect element. I could see the populated source code. See the attached picture.
Also, I was inspired by this commend that helps a lot.
I could open up the javascript console in Chrome, and if I type in
document
I could see the complete html sitting there, which is exactly what I want. I am wondering is there a way to store the js populated source code using selenium?
I've read some posts saying that it requires some security work to store the populated document to client's side.
Hope I have made myself clear and appreciates any suggestion or correction.
(Note, I have zero experience with JS so detailed explaination would be gratefully appreciated!)

Related

How to trigger consistent events in single browser console in one time?

For example I would like to control multiple”online”webpages ,such as google.com ,with only a sequence of consistent codes on browser console(ctrl+shift+j in windows system)for consistent action "such as clicking button A in A web and jumping to web B automatically , then clicking button B in B web."
(as usual,I must type other codes in refreshed console once I jump to anther webpage to change html)
For example:
//on the console1
`document.getElementbyid(“id1”).click()
`
//id1 is inside the web1
`window.open(“link_of_new_webpage”,”_self”)`
//I understand that it will be just a new page with new console.
//but I mean I was looking forward to some kind of these things.
//and below is id2 inside web2 in console2
`document.getElementbyid(“id2”).click()`
in conclusion i want a console likely showed below
document.getElementbyid(“id1”).click()
window.open(“link_of_new_webpage”,”_self”)`
document.getElementbyid(“id2”).click()
//the code above is actually consistent!
//not like code below
document.getElementbyid(“id1”).click()//in console 1 in web1
////////////////////////////////////////////////////////////////////////////////////////////////////
document.getElementbyid(“id2”).click()//in console 2 in web2
Plz someone helps me or tell me that it cannot be fulfilled
(in fact ,I was freshman in JavaScript and Html)
I promise I will use these tools in correct way.thx beyond description !!
If you're trying to do this on a client facing site I'm afraid what I believe you're asking to do is impossible for a myriad of security reasons. If you're just looking to run these executions locally though you're in luck. Headless browsers are commonly used in testing as well as web scraping and allow you to program commands as if a user was interacting directly with the browser. If you're looking to stay within the confines of JS a popular option is Puppeteer.
Best of luck

How to extract information from web page

I'm looking for a way to automatically extract information from a web page, more specifically an online game (https://www.virtualregatta.com/fr/offshore-jeu/).
In the game, I want to extract/copy the position of the boat. With Mozilla and its debug tools, I used the network debugger and I saw an HTML POST request containing what I want.
It seems that we receive as a response a json containing a structure with latitude/longitude.
This is perfect to me, but I want a more user friendly way to get it and I would need advices. Problem is that I'm really a beginner in web development haha.
Is it possible to do this using a script ? (But I suppose it will be complicated to first log into the game)
Is it possible to create a basic Mozilla plugin which would be able to catch the request/response and copy the position to clipboard for me ?
anything else ?
EDIT:
I've tried using a Mozilla plugin, and I achieved to add a listener on POST request. I see the request to get the boat information but I can't find a way to get the json response in JS.
function logURL(responseDetails) {
console.log(responseDetails);
}
browser.webRequest.onResponseStarted.addListener(
logURL,
{urls: ["*://*.virtualregatta.com/getboatinfos"]}
);
In Chrome I use Broomo for this purposes. It helps you to add scripts in web pages, you can console.log the POST you found, and of course you can create functions and Use the webpage Backend.
In firefox I found this one js-injector. But I didn't use it before.
Update:
Now there are a new extension for both browsers:
Chrome: ABC JS-CSS Injector
Firefox: ABC JS-CSS Injector

Scraping advice on Crawling and info from Javascript onclick() function

I've finally found a thread on newbie help on this subject but I am no way forward with resolving this issue, partly because I'm a newbie at programming :)
The thread is:
Newbie: How to overcome Javascript "onclick" button to scrape web page?
I have a similar issue. The site I would like to scrape from has lots of information of a lot of parts, but I would like to only scrape certain part information (company, part number, etc). I have two issues:
How to grab such information from this site without the need to put in search information? Use a Crawler?
A part number has most of the information on a page but there is on page Javascript 'onclick()' function, when it is clicked opens up a small window displaying information that, in addition to, I would like to scrape. How can I scrape the information in this additional window?
I'm using import.io but have been advised to switch to Selenium and PhantomJS. I would welcome other suggestions, and not too complicated (or instructions provided, which would be awesome!), of other tools. I would really appreciate if someone can help me overcome this issue or provide instructions. Thank you.
If you are a newbie and you want to create a web crawler for data extraction then I would recommend selenium however, selenium webdriver is slower than scrapy (a python framework for coding web crawlers)
As you have been advised to use selenium, I will only focus on selenium using python.
For your first issue : "How to grab such information from this site"
Suppose the website from which you want to extract data is www.fundsupermart.co.in (selected this to show how to handle new window pop ups)
using selenium you can crawl by writing:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.fundsupermart.co.in/main/fundinfo/mutualfund-AXIS-BANKING-DEBT-FUND--GROWTH-AXS0077.html')
This will open the firefox browser webdriver and will load the page of the link provided in the get() method
Now suppose if you want to extract a table then you can extract by using its tag_name, xpath or class_name by using functions provided by selenium. Like here if I want to extract table under "Investment Objective" :
Then for this I will:
right click -> inspect element -> find the appropriate tag from console -> right click -> copy xpath
Here I found that <tbody> tag was the one from which I can extract the table so I right clicked that and clicked on copy xpath so I got the xpath of that tag i.e. :
xpath=/html/body/table/tbody/tr[2]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr[1]/td/font/table/tbody/tr[1]/td/table/tbody/tr[5]/td/table/tbody
then, in the code add this line:
driver.find_element_by_xpath(xpath).text
Similarly you can extract other data from any website also see selenium's doc here
For you second issue : "How can I scrape the information in this additional window?"
For clicking the link you can use click() function provided by selenium. Suppose here I want to click the link : Click here for price history then I will get the xpath(as done previously) and add line :
driver.find_element_by_xpath(xpath).click()
I will open a new window like this :
Now to extract data from new window you will have to switch to new window which you can do by adding this line:
windows = driver.window_handles
driver.switch_to_window(windows[1])
Now, by doing this I have switched the webdriver to the new window and now I can extract data as I did earlier and to close this window and switch back to original window just add :
driver.close()
driver.switch_to_window(windows[0])
This was a very basic and naive approach of web crawlers using selenium. The tutorial given here is really good and will help you a lot.

Scrape sub elements in tables which cannot found in html but only in Chrome>F12>Element

I have tried to scrape scoring/event time and also player name http://en.gooooal.com/soccer/analysis/8401/events_840182.html.However cannot work.
require(RCurl);
require(XML);
lnk = "http://en.gooooal.com/soccer/analysis/8401/events_840182.html";
doc = htmlTreeParse(lnk,useInternalNodes=TRUE);
x = unlist(xpathApply(doc, "//table/tr/td"));
normal html page doesn't show the details of the table contents.
the nodes only can get from
>>> open Chrome >>> click F12 >>> click Element
Can someone help? Thanks a lot.
If you reload the page while Chrome developer tools are active, you can see that real data is fetched via XHR from http://en.gooooal.com/soccer/analysis/8401/goal_840182.js?GmFEjC8MND. This URL contains event id 840182 which you can scrape from the page. The part after ? seems to be just a way to circumvent browser caching. 8401, again, seems to be just first digits of the id.
So, you can load the original page, construct the second URL, and get real data from there.
Anyway... In most cases it's a morally questionalble practice to scrape data from web sites. I hope you know what you're doing :)
It sounds as if the content was inserted asynchronously using javascript, so using Curl won't help you there.
You'll need a headless browser which can actually parse and execute javascript (If you know ruby you could start looking for the cucumber-selenium-chromedriver combo), or maybe just use your browser with greasemonkey/tampermonkey to actually mimic a real user browsing the score scraping.
The contents are probably generated (by Javascript, like from an ajax call) after loading the (HTML) page. You can check that by loading the page in Chrome after disabling Javascript.
I don't think you can instruct RCurl to execute Javascript...

flashfirebug getting data from actionscript 3 console

My need was to capture data (text data) from flash in a web page.
The data is always changing (wheather data) and this should be exported do a text file so i could manipulate this data.
I tried do this with and my first approach was using a websniffer like fiddler or wireshark.
I used that but could't get data from both because it is embedded in flash.
I used fidler as man-in-midle with wireshark deciphering the data (with the private key from the site cer) but it didn't worked.
After that i tried using flashfirebug pro (the pro allows to run as3 comands in the console). This addon loads the dom tree and refreshes it. After selecting in the page the desired element with inspector (it shows in the left panel the instance and position in the dom) i have acess to the instance properties (and the only one needed is the "html-text" in the right panel).
My problem with this last approach was that it could not communicate with the local file system (if i make "trace(this.text);" in the console it shows the text value but it just shows in the console). The only way to communicate to the file in the hard drive, that i could think of was to throw some error to the log file but could't do that also.
Does anyone have any idea to work with flashfirebug or have some other approach to do this.
Regards,
if you want to work on local filesystems use adobe air.
if you can't, try to work around the browsers sandbox with javascript as bridge to some browser-plugin/-addon which gives you access to local processes and filesystems. to use javascript from flash the ExternalInterface class is your friend.

Categories

Resources