Retrieve the source of a dynamic website using python (bypassing onclick) - javascript

I wish to retrieve the source of a website, that is dynamically generated upon clicking a link. The link itself is as below:
<a onclick="function(); return false" href="#">Link</a>
This stops me from directly querying for a URL that would allow me to get the dynamically generated website (urllib/2).
How would one retrieve the source of the website, which was generated with the above function (in HTML) via python? Is there a method to bypass the return false" href="#"? Or the onclick entirely, and get the actual URL?
If there is another way to generate the website from the abstract link above, so that one can get it from urllib in python, please refer me to it.
EDIT:
I generalized the code seen above - however I've been told that one has to reverse engineer the specific javascript to be able to use it.
Link to .js - http://a.quizlet.com/j/english/create_setku80j8.js
Link to site with link:
<a onclick="importText(); return false" href="#">Bulk-import data</a>
Actual URL of site: http://quizlet.com/create_set/
Beautified JS of relevant .js above: http://pastie.org/737042

You will probably have to reverse engineer the JavaScript to work out what is going on.
Can you provide the site and the link in question?

I don't immediately see any content-generation or link-following code in that script; all importText does is toggle whether a few divs are shown.
If you want to study the calls the webapp makes to do a particular action, in order to reproduce them from a bot, you're probably best off looking at the HTTP requests (form submissions and AJAX calls) that the browser makes whilst performing that action. You can use Firebug's ‘Net’ panel to study this for Firefox, or Fiddler for IE.

Related

Crawl a webpage and grab all dynamic Javascript links

I am using C# to crawl a website. All works fine except it can't detect dynamic JS links. As an example, a page with over 100 products may have few pages and the "Next Page"m "Prev Page" link may JS dynamic urls which is generated on click. Typical JS code is below:
<a href="javascript:PageURL('
cf-233--televisions.aspx','?',2);">></a>
Is there anyway of getting the actual link of the above href while collecting urls on the page ?
I am using Html Agility Pack but open to any other technology. I tried google this many times but seems no solution yet.
Thanks.
Have you tried to evaluate javascript to get actual hrefs? It might be helpful Parsing HTML to get script variable value
Or maybe you should check what PageURL function does (Just open the website with a browser and write at it's console PageURL without parentheses. It will show you code of the function) and rewrite it with C#
AbotX allows you to render the javascript on the page. Its a powerful web crawler with advanced features.

PHP HttpRequest to create a web page - how to handle long response times?

I am currently using javascript and XMLHttpRequest on a static html page to create a view of a record in Zotero. This works nicely except for one thing: The page html title.
I can of course also change the <title>...</title> tag, but if someone wants to post the view to for example facebook the static title on the web page will be shown there.
I can't think of any way to fix this with just a static page with javascript. I believe I need a dynamically created page from a server that does something similar to XMLHttpRequest.
For PHP there is HTTPRequest. Now to the problem. In the javascript version I can use asynchronous calls. With PHP I think I need synchronous calls. Is that something to worry about?
Is there perhaps some other way to handle this that I am not aware of?
UPDATE: It looks like those trying to answer are not at all familiar with Zotero. I should have been more clear. Zotero is a reference db located at http://zotero.org/. It has an API that can be used through XMLHttpRequest (which is what I said above).
Now I can not use that in my scenario which I described above. So I want to call the Zotero server from my server instead. (Through PHP or something else.)
(If you are not familiar with the concepts it might be hard to understand and answer the question. Of course.)
UPDATE 2: For those interested in how Facebook scraps an URL you post there, please test here: https://developers.facebook.com/tools/debug
As you can see by testing there no javascript is run.
Sorry, im not sure if i understand what you are trying to ask, are you just wanting to change the pages title?
Why not use javascript?
document.title = newTitle
Facebook expects the title (or opengraph :title tags) to be present when it fetches the page. It won't execyte any JavaScript for you to fill in the blanks.
A cool workaround would be to detect the Facebook scraper with PHP by parsing the User Agent string, and serving a version of the page with the information already filled in by PHP instead of JavaScript.
As far as I know, the Facebook scraper uses this header for User Agent: "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
You can check to see if part of that string is present in the header and load the page accordingly.
if (strpos($_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit') !== false)
{
//synchronously load the title and opengraph tags here.
}
else
{
//load the page normally
}

Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

what's an alternative to make dynamic links (other than javascript) to make them google-crawable? [duplicate]

This question already has answers here:
How is Github Changing Pages and the URL so smoothly without AJAX?
(2 answers)
Closed 9 years ago.
I have a page where in some links are built when the page loads using a javascript function makes link, depending on the current url of the page. On click, an ajax call loads the new page.
How do I make these links google-crawlable (since google doesn't crawl javascript links)?
As an example I'd like to mention github. So when you open, say, https://github.com/pennersr/django-allauth all the links inside are already loaded with their respective links, depending on the current URL. When you view-source, you can see the links there. Whereas you just see a javascript function in the view-source had the links been made through javascript. I don't think these values are being passed from the back-end as well.
What's a possible solution to do this?
This is a common issue in Single Page Applications or applications that use intensively JavaScript and AJAX.
First of all you need to create unique URL for this actions in JavaScript, so the crawler can at least "hit these actions". If you execute a function in JavaScript, but your URL doesn't change, Google will never be able to know that there's something happening there.
Normally AJAX URL's are written like this:
http://www.foo.com!#jsAction
Google crawler will be able to crawl this URL but probably the page that will get back will be blank since is the JavaScript code the responsible to render all the content.
This is why the crawler will change the '!#' for the word _escaped_fragment_ when calling your server. So the previous URL being requested by the crawler would be:
http://www.foo.com?_escaped_fragment_=jsAction
With this new keyword in the URL we can determine in the server that the request comes from a crawler, and here is when the magic starts.
Using a headless browser like PhantomJS we can execute the JavaScript code in the server and return the fully rendered HTML to the crawler request. This is one of the approaches that Google suggest in their guidelines.
So basically the point is to determine which type of request you get and execute different code depending if the query string contains _escaped_fragment_.
This link from Google might help you to point you to the right direction: https://developers.google.com/webmasters/ajax-crawling/
Hope it helps!

Ajax page part load and Google

I have some div on page loaded from server by ajax, but in the scenario google and other search engine don't index the content of this div. The only solution I see, it's recognize when page get by search robot and return complete page without ajax.
1) Is there more simple way?
2) How distinguish humans and robots?
You could also provide a link to the non-ajax version in your sitemap, and when you serve that file (to the robot), you make sure to have included a canonical link-element to the "real" page you want users to see:
<html>
<head>
[...]
<link rel="canonical" href="YOUR_CANONICAL_URL_HERE" />
[...]
</head>
<body>
[...]
YOUR NON_AJAX_CONTENT_HERE
</body>
</html>
edit: if this solution is not appropriate (some comments below points out that that this solution is non-standard and only supported by the "big-three"), you might have to re-think whether you should make the non-ajax version the standard solution, and use JavaScript to hide/show the information instead of fetching it via AJAX. If it is business critical information that is fetched, you have to realize that not all users have JavaScript enabled, and thus they won't be able to see this information. A progressive enhancement approach might be more appropriate in this case.
Google gets antsy if you are trying to show different things to you users than to crawlers. I suggest simply caching your query or whatever it is that needs AJAX and then using AJAX to replace only what you need to change. You still haven't really explained what's in this div that only AJAX can provide. If you can do it without AJAX then you should be, not just for SEO but for braille readers, mobile devices and people without javascript.
You can specify a sitemap in your robots.txt. That sitemap should be a list of your static pages. You should not be giving to Google a different page at the same URL, so you should have a different URL with static and dynamic content. Typically, the static URL is .../blog/03/09/i-bought-a-puppy and dynamic URL is something like .../search/puppy.

Categories

Resources