Making AJAX calls with Python - javascript

I am trying to get the value of the href attribute of an anchor element from a web page using a self-made Python script. However, all of the contents of the div element inside which the anchor element sits are received by the web page by using AJAX jQuery calls when the web page initially loads. The div element contains about 90% of the web page's content. How can I get the contents of the div element and then the value of the href attribute of the anchor element?
Later, after I get the value of the 'href' attribute, I want to get the contents of the web page that the link points to. But unfortunately, that call is also made with AJAX (jQuery). When I click on this in the web browser, the address of the web page does not change in the address bar, which means that the contents of the web page that is received is loaded into the same web page (inside the above mentioned div) element.
After I get this, I will be using BeautifulSoup to parse the web page. So, how will I be able to do this with Python? What sort of modules do I need to use? And what is the general pseudo-code required?
By the way, the anchor element has an onclick event handler that triggers the corresponding jQuery function that loads the contents into the div element inside the web page.
Moreover, the anchor element is not associated with an id, if its needed for the solution.

You'd want to use a headless web browser. Take a look at Ghost.py or phantompy.
I just realized that phantompy is no longer being actively developed, so here's an example with Ghost.py.
I created an HTML page which is blank. Some JavaScript adds a couple links to a div.
<html>
<body>
<div id="links">
<!-- Links go here -->
</div>
</body>
<script type="text/javascript">
var div = document.getElementById('links');
var link = document.createElement('a');
link.innerHTML = 'DuckDuckGo';
link.setAttribute('href', 'http://duckduckgo.com');
div.appendChild(link);
</script>
</html>
So if you were to scrape the page right now with Beautiful Soup using something like soup.find_all('a') you wouldn't get an links, because there aren't any.
But we can use a headless browser to render the content for us.
>>> from ghost import Ghost
>>> from bs4 import BeautifulSoup
>>>
>>> ghost = Ghost()
>>>
>>> ghost.open('http://localhost:8000')
>>>
>>> soup = BeautifulSoup(ghost.content)
>>> soup.find_all('a')
[DuckDuckGo]
If you have to do something like clicking a link to change the content on the page, you could also do this. Check out the Sample use case on the project's website.

Related

Table showing on web page but no where to be found in HTML. Python, Selenium

This is a private website, so I will try my best to explain it.
Scraping with Selenium, and table is not in the HTML at all, although it is visible to the user and interactable.
The HTML code on the page is accessed through a navbar click, and then the HTML is pretty simple. Has three or so divs, and some scripts.
<head></head>
<body>
<div></div>
<div></div>
</body>
But the actual webpage has an entire dynamic table that opens reports in new windows when clicked on as well as a bunch of other stuff.
Why can I see everything on the webpage, but not in Selenium OR with Inspect Element? (unless I inspect search bar)
The missing HTML showed up when I clicked inspect element on the search bar, and from there I was able to view the HTML, but that isn't a solution for Selenium.
I dont know if this helps, but the ID's of the hidden HTML have this in it: 'crmGrid_visualizationCompositeControl'
Thank you!
Makes sense, Selenium does NOT know how to find elements inside an Iframe. You must tell the Selenium to switch into the Iframe so it will be able to "see" the elements inside:
iframe = driver.find_element_by_xpath("//iframe[#name='Name_Of_Iframe']")
driver.switch_to.frame(iframe)
If the Iframe has an Id - find it by Id. You can also switch to other iframes by index:
iframe = driver.find_elements_by_tag_name('iframe')[2]
driver.switch_to.frame(iframe)

Access the table hidden inside an iframe using python and selenium

I'm trying to access announcements table from this web page. The table is in an iframe, the contents of which are not visible in the source when the page loads. The table only shows up in source if I do inspect element "twice". Once the table is visible I can execute the below javascript code via chrome console to access the table.
document.getElementsByTagName('html')[0].getElementsByTagName('body')[0].getElementsByTagName('section')[4].getElementsByTagName('article')[0].getElementsByTagName('div')[2].getElementsByTagName('announcement_data')[0].getElementsByTagName('table')[0].getElementsByTagName('tbody')[0].getElementsByTagName('tr')
However, I'm struggling to find a way to make the elements inside the iframe visible programmatically using python and selenium. I've tried to switch to the iframe but that has not helped.
seq = driver.find_elements_by_tag_name('iframe')
print("No of frames present in the web page are: ", len(seq))
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.frame(iframe)
table = driver.execute_script("return document.getElementsByTagName('html')[0].getElementsByTagName('body')[0].getElementsByTagName('section')[4].getElementsByTagName('article')[0].getElementsByTagName('div')[2].getElementsByTagName('announcement_data')[0].getElementsByTagName('table')[0].getElementsByTagName('tbody')[0].getElementsByTagName('tr');")
I get the following error if I try to run the above code in my jupyter notebook -
No of frames present in the web page are: 4
Error getting the length of the table: list index out of range
Any pointers to access the length and content of the table would be appreciated.
Thank you.
You should be able to scrape the iframe url programmatically and then load that up as a new page in selenium.
https://www.asx.com.au/asx/v2/statistics/todayAnns.do is the url for the iframe.
To do it programmatically try something like this:
url = driver.find_element_by_class_name('external-form__iframe default').get_attribute("src")
driver.get(url)

cURL returns full HTML via AJAX - how to display to user?

I am building a Wordpress plugin to display a list of jobs to a user pulled from a recruiting platform API. On click of a job, a cURL request is sent to the API that pulls the job details as a full HTML page (the online job advertisement). I have everything working fine in terms of pulling the HTML, but I cannot figure out how to display it to the user.
How can I either:
Open a new tab to display the HTML pulled from the AJAX request
or
Open the full HTML within a div on the same page (i.e. a modal)
I would prefer to open the HTML in a new page, but don't know how to use jQuery to do this... Opening within the page in a modal is also fine, but as far as I understand iFrames (which I would rather not use anyway), you have to pass a url (and I simply have the full markup). Is there a way to display this within a page, perhaps using canvas? It carries its own links to CSS and Javascript that need to apply only within that sub-page.
EDIT:
As a clarification, I know that I can simply place the HTML within the page. My issue is that it is a full page. This means it has a <head> <body>, and its own CSS links. Just putting it in the page messes with the rest of the CSS and produces invalid HTML.
This is what I already have:
$.post(ajaxurl, data, function(response) {
$('.sg-jobad-full').html(response);
});
It places the response within the page perfectly well... but it messes up the page by introducing a <body> within a <body> and competing CSS.
If you put the response in a <div>, it will mess the markup because css/js/meta definitions may not be put into the <body>.
If there is a way to retrieve the data without the markup already beeing in, you could parse the data and let it print via a javascript, which is the method I'd prefere.
According to your comment, you should really go with iframes, all other methods will alter your markup to have <html> tags inside <html>, which is very bad practice.
Iframes can be styled just like a <div> element, and it is realy not dirty to use iframes for the purpose you mentioned (it does not load from a foreign host, it is not hidden, it does not track).
<iframe class="job-offers-plugin" src=".../wp-content/plugins/yourplugin/getJobs.php">
</iframe>
Put some style into it like width;height;padding;margin;overflow; place it where you like..
This helps you with the databse:
Using WPDB in standalone script?
Add permalinks to your plugin script:
http://teachingyou.net/wordpress/wordpress-how-to-create-custom-permalinks-to-use-in-your-plugins-the-easy-way/
If you get the full HTML in an jQuery.ajax(...) call, you can always just show it in a certain div on your page.
$.ajax({
success: function (resp){
// resp should be your html code
$("#div").html(resp);
}
});
You can use the $(selector).html(htmlCode) everywhere you want. You can insert it into modals, divs, new pages...
If you have to inject a whole HTML page you can:
strip the tags you don't need
or
use an iframe and write the content to that iframe: How to set HTML content into an iframe
iframes aren't my favourite thing... but it's a possibility

Can you select tags from another page?

If I have this page a.html that has all the jquery codes. And there is another page b.html that has only html tags. Is it possible to do something like:
alert( $('a').fromhref('b.html').html() );
Basically I want to select a tag from another page. I want to basically avoid the use of iframes and httprequests.
You can access parts of another page with jQuery, provided both pages are on the same domain, using load(), but this can only be done with an http request (though if the page is cached, it might not be necessary), as a brief example:
$('#idOfElementOnPageA').load('http://example.com/pageB.html #idOFElementOnPageB');
This will load the html of the element with an id of idOfElementOnPageB into the element with the id of idOfElementOnPageA.
But please note, this in no way avoids making a call to the server, though it does allow you to retrieve elements from another page without using iframe elements in your page.
References:
load().
The filename should be script.js instead of a.html, then, use the script tag.
Basically, something like this (in b.html):
<script src="script.js"></script>
As long as script.js is in the same folder as b.html.

Get at entire web page contents using Javascript

Is there a way to load the entire contents of a page into a javascript variable? (the page is not properly formatted HTML.) Ie store the page contents as a string in a variable. It only needs to work with Firefox.
I have some javascript running in one firefox tab that accesses the content of a page in another tab (the target window). Normally the content of the target is an HTML page so I can get at its content like this...
targetWindowName.document.getElementsByTagName("html")[0].innerHTML;
However I have come across a page that is not in proper HTML and so the above doesnt work.
(The actual content of this awkward page is JSON. I know this would be best loaded up with AJAX or something but I have a framework already setup to process HTML pages and it would be very handy if I can treat this particular (one off) page just like a regular HTML page.)
Thanks
Guess you can use:
win.document.documentElement.innerHTML
Read the file into a variable. Like you would any text file.
So, Page "A" has code that goes out and gets the HTML page contents and loads it into a variable.

Categories

Resources