Access the table hidden inside an iframe using python and selenium - javascript

I'm trying to access announcements table from this web page. The table is in an iframe, the contents of which are not visible in the source when the page loads. The table only shows up in source if I do inspect element "twice". Once the table is visible I can execute the below javascript code via chrome console to access the table.
document.getElementsByTagName('html')[0].getElementsByTagName('body')[0].getElementsByTagName('section')[4].getElementsByTagName('article')[0].getElementsByTagName('div')[2].getElementsByTagName('announcement_data')[0].getElementsByTagName('table')[0].getElementsByTagName('tbody')[0].getElementsByTagName('tr')
However, I'm struggling to find a way to make the elements inside the iframe visible programmatically using python and selenium. I've tried to switch to the iframe but that has not helped.
seq = driver.find_elements_by_tag_name('iframe')
print("No of frames present in the web page are: ", len(seq))
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.frame(iframe)
table = driver.execute_script("return document.getElementsByTagName('html')[0].getElementsByTagName('body')[0].getElementsByTagName('section')[4].getElementsByTagName('article')[0].getElementsByTagName('div')[2].getElementsByTagName('announcement_data')[0].getElementsByTagName('table')[0].getElementsByTagName('tbody')[0].getElementsByTagName('tr');")
I get the following error if I try to run the above code in my jupyter notebook -
No of frames present in the web page are: 4
Error getting the length of the table: list index out of range
Any pointers to access the length and content of the table would be appreciated.
Thank you.

You should be able to scrape the iframe url programmatically and then load that up as a new page in selenium.
https://www.asx.com.au/asx/v2/statistics/todayAnns.do is the url for the iframe.
To do it programmatically try something like this:
url = driver.find_element_by_class_name('external-form__iframe default').get_attribute("src")
driver.get(url)

Related

Table showing on web page but no where to be found in HTML. Python, Selenium

This is a private website, so I will try my best to explain it.
Scraping with Selenium, and table is not in the HTML at all, although it is visible to the user and interactable.
The HTML code on the page is accessed through a navbar click, and then the HTML is pretty simple. Has three or so divs, and some scripts.
<head></head>
<body>
<div></div>
<div></div>
</body>
But the actual webpage has an entire dynamic table that opens reports in new windows when clicked on as well as a bunch of other stuff.
Why can I see everything on the webpage, but not in Selenium OR with Inspect Element? (unless I inspect search bar)
The missing HTML showed up when I clicked inspect element on the search bar, and from there I was able to view the HTML, but that isn't a solution for Selenium.
I dont know if this helps, but the ID's of the hidden HTML have this in it: 'crmGrid_visualizationCompositeControl'
Thank you!
Makes sense, Selenium does NOT know how to find elements inside an Iframe. You must tell the Selenium to switch into the Iframe so it will be able to "see" the elements inside:
iframe = driver.find_element_by_xpath("//iframe[#name='Name_Of_Iframe']")
driver.switch_to.frame(iframe)
If the Iframe has an Id - find it by Id. You can also switch to other iframes by index:
iframe = driver.find_elements_by_tag_name('iframe')[2]
driver.switch_to.frame(iframe)

Extracting data from a web page and inserting into html code of another webpage

I have multiple webpages that only contain a single number in the HTML code, nothing else. These get updated from an IOT device I made. I also have a main data webpage, that is meant to display all of the data on one page. However, I cannot figure out how to extract the number the data webpages' in the HTML code of the main data webpage.
The only workaround I have found is by using iframes, but this seems very clumsy, and not "pleasing for the eyes". Is there any way to do this using HTML, or maybe javascript? I am very new to web development. The only restriction is that I cannot change how the data is uploaded from the IoT device, so it has to be extracted from the individual data webpages that contain only numbers.
Thanks in advance :)
Once you find the Javascript selector for the innerHTML of your iFrame, you can get the value out of the iframe for use within the parent or window:
//Getting a value from the first iframe
//Assumes the page in the iframe only contains a single plaintext value
myValue = document.querySelectorAll('iframe')[0].contentWindow.document.body.innerText;
To make it more appealing you could easily hide the iFrame:
//Hiding the embedded iframes
for (var i=0; i<document.querySelectorAll('iframe').length; i++){
document.querySelectorAll('iframe')[i].style.visibility = "hidden";
};
What you need to do is
Fetch the html pages
Parse the data from each page
Render it to your current page
This is not trivial task, mostly because parsing and handling multiple fetches that may take random amount of time to complete. And browser security restrictions in case the files are in different domains.
It would be better if the individual files were .json files not html files.
Here is a codesandbox with a simple example: https://codesandbox.io/s/sleepy-borg-kuye6?file=/src/index.js
Here is one tutorial:
https://gomakethings.com/getting-html-with-fetch-in-vanilla-js/

Making AJAX calls with Python

I am trying to get the value of the href attribute of an anchor element from a web page using a self-made Python script. However, all of the contents of the div element inside which the anchor element sits are received by the web page by using AJAX jQuery calls when the web page initially loads. The div element contains about 90% of the web page's content. How can I get the contents of the div element and then the value of the href attribute of the anchor element?
Later, after I get the value of the 'href' attribute, I want to get the contents of the web page that the link points to. But unfortunately, that call is also made with AJAX (jQuery). When I click on this in the web browser, the address of the web page does not change in the address bar, which means that the contents of the web page that is received is loaded into the same web page (inside the above mentioned div) element.
After I get this, I will be using BeautifulSoup to parse the web page. So, how will I be able to do this with Python? What sort of modules do I need to use? And what is the general pseudo-code required?
By the way, the anchor element has an onclick event handler that triggers the corresponding jQuery function that loads the contents into the div element inside the web page.
Moreover, the anchor element is not associated with an id, if its needed for the solution.
You'd want to use a headless web browser. Take a look at Ghost.py or phantompy.
I just realized that phantompy is no longer being actively developed, so here's an example with Ghost.py.
I created an HTML page which is blank. Some JavaScript adds a couple links to a div.
<html>
<body>
<div id="links">
<!-- Links go here -->
</div>
</body>
<script type="text/javascript">
var div = document.getElementById('links');
var link = document.createElement('a');
link.innerHTML = 'DuckDuckGo';
link.setAttribute('href', 'http://duckduckgo.com');
div.appendChild(link);
</script>
</html>
So if you were to scrape the page right now with Beautiful Soup using something like soup.find_all('a') you wouldn't get an links, because there aren't any.
But we can use a headless browser to render the content for us.
>>> from ghost import Ghost
>>> from bs4 import BeautifulSoup
>>>
>>> ghost = Ghost()
>>>
>>> ghost.open('http://localhost:8000')
>>>
>>> soup = BeautifulSoup(ghost.content)
>>> soup.find_all('a')
[DuckDuckGo]
If you have to do something like clicking a link to change the content on the page, you could also do this. Check out the Sample use case on the project's website.

Scraping dynamic page content phantomjs

My company is using a website that hosts all of our FAQ and customer questions. We have plans to go through and wipe out all of the old data and input new and the service does not have a backup, or archive option for questions we don't want to appear anymore.
I've gone through and tried to scape the site using perl and mechanize, but I'm missing the customer comments on the page as they are loaded through ajax. I have looked at phantomjs and can get the pages to save to an image using an example page, however, I'd like to get an full page html dump of the page, but can't figure out how. I used this example code on our site
var page = new WebPage();
page.open('http://espn.go.com/nfl/', function (status) {
//once page loaded, include jQuery from cdn
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
//once jQuery loaded, run some code
//inserts our custom text into the page
page.evaluate(function(){$("h2").html('Many NFL Players Scared that Chad Moon Will Enter League');});
//take screenshot and exit
page.render('espn.png');
phantom.exit();
});
});
Is there a way using phantomjs that I can just get a full page dump of the data, similar to if I did a view source in chrome? I can do this with perl + mechanize, but don't see how to do this using phantomjs.
You can use page.content to get the full HTML DOM
I would recommend pjscrape http://nrabinowitz.github.com/pjscrape/ if you want to scrape using PhantomJS

jQuery Copy text inside iframe to main document?

Hopefully someone here can help me with this challenge!
I have a parent page which is the checkout page for an e-commerce site. It's run on zencart and for every order placed a table row is generated through ZenCart. I've setup an EACH function which generates an iframe for an artwork uploader for each TR (order) found. The uploader works and the correct number of instances are being generated.
I wanted to avoid an iFrame, but the uploader script I purchased will not permit me to load it directly into the zencart page template, or via AJAX (tried both). There's some kind of major resource/path situation going on when doing it this way... so I've resorted to iframes.
I'm able to call JS on file-upload-complete. At that point I'm trying to capture the name of the filename that was just uploaded and place it inside the TR. The problem I'm running into are permission error when trying to access the iframe contents.
I've tried everything I've come across on this site and many others, so believe it isn't a problem with the selectors/frame selection... Firebug is telling me that I'm getting permission errors when trying to access the iframe, yet they're both on the same domain and the src is being set by a relative path....
I'm completely lost, any suggestions would be appreciated! Thanks!
www.prothings.com/store
Add items to the cart and go to checkout.....
when you want to access main window or window.document from inside an iframe you should change the context by using window.parent
For example when you want to append some text to a div, you should do something like this
window.parent.$.('#theDiv').text('the text');
There is a bug in IE when you run the code from inside the iframe and remove the iframe in between. IE can't run the code in the fly

Categories

Resources