I am using C# to crawl a website. All works fine except it can't detect dynamic JS links. As an example, a page with over 100 products may have few pages and the "Next Page"m "Prev Page" link may JS dynamic urls which is generated on click. Typical JS code is below:
<a href="javascript:PageURL('
cf-233--televisions.aspx','?',2);">></a>
Is there anyway of getting the actual link of the above href while collecting urls on the page ?
I am using Html Agility Pack but open to any other technology. I tried google this many times but seems no solution yet.
Thanks.
Have you tried to evaluate javascript to get actual hrefs? It might be helpful Parsing HTML to get script variable value
Or maybe you should check what PageURL function does (Just open the website with a browser and write at it's console PageURL without parentheses. It will show you code of the function) and rewrite it with C#
AbotX allows you to render the javascript on the page. Its a powerful web crawler with advanced features.
Related
I need to make it so that when users visit a web page like example.com/news it automatically brings them to a different website like cnn.com without them having to click anything. It would be preferable if they would not even see the original web page and it would bring them directly to the other site (cnn.com in this case) I think I can use the onload event in html but I have little experience in javascript and don't know what code to use in order to accomplish this task. Thank you!I do not want to use jquery if possible.
Just one line of code (inside script tags)
<script>
window.location.href = "http://exampleurl.com";
</script>
You would be better off using headers. Depends what server side scripting language you are using. For PHP you would have the following:
header('Location: http://www.example.com/');
I have a website and want to have links to a page which has a filter function. I would like to create a link in such a way that when followed I do not simply get the destination page, but rather the page with a filter already applied.
To be more specific I am looking at the website for NetCDF CF standard names. From my page I would like to have link that would already filter e.g. for 'longitude' on the destination page.
The destination page is using javascript to apply the filter function.
Any ideas how to achieve that?
It's impossible to control a JS on a site from an external URL.
But you can do a something else: Download the data from the external site to your server via a server-side script (like php), and recreate the filters on your site in JS. But this way you should care about copyrights and you have to maintain your script if they changed the table structure they used.
I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!
This question already has answers here:
How is Github Changing Pages and the URL so smoothly without AJAX?
(2 answers)
Closed 9 years ago.
I have a page where in some links are built when the page loads using a javascript function makes link, depending on the current url of the page. On click, an ajax call loads the new page.
How do I make these links google-crawlable (since google doesn't crawl javascript links)?
As an example I'd like to mention github. So when you open, say, https://github.com/pennersr/django-allauth all the links inside are already loaded with their respective links, depending on the current URL. When you view-source, you can see the links there. Whereas you just see a javascript function in the view-source had the links been made through javascript. I don't think these values are being passed from the back-end as well.
What's a possible solution to do this?
This is a common issue in Single Page Applications or applications that use intensively JavaScript and AJAX.
First of all you need to create unique URL for this actions in JavaScript, so the crawler can at least "hit these actions". If you execute a function in JavaScript, but your URL doesn't change, Google will never be able to know that there's something happening there.
Normally AJAX URL's are written like this:
http://www.foo.com!#jsAction
Google crawler will be able to crawl this URL but probably the page that will get back will be blank since is the JavaScript code the responsible to render all the content.
This is why the crawler will change the '!#' for the word _escaped_fragment_ when calling your server. So the previous URL being requested by the crawler would be:
http://www.foo.com?_escaped_fragment_=jsAction
With this new keyword in the URL we can determine in the server that the request comes from a crawler, and here is when the magic starts.
Using a headless browser like PhantomJS we can execute the JavaScript code in the server and return the fully rendered HTML to the crawler request. This is one of the approaches that Google suggest in their guidelines.
So basically the point is to determine which type of request you get and execute different code depending if the query string contains _escaped_fragment_.
This link from Google might help you to point you to the right direction: https://developers.google.com/webmasters/ajax-crawling/
Hope it helps!
I wish to retrieve the source of a website, that is dynamically generated upon clicking a link. The link itself is as below:
<a onclick="function(); return false" href="#">Link</a>
This stops me from directly querying for a URL that would allow me to get the dynamically generated website (urllib/2).
How would one retrieve the source of the website, which was generated with the above function (in HTML) via python? Is there a method to bypass the return false" href="#"? Or the onclick entirely, and get the actual URL?
If there is another way to generate the website from the abstract link above, so that one can get it from urllib in python, please refer me to it.
EDIT:
I generalized the code seen above - however I've been told that one has to reverse engineer the specific javascript to be able to use it.
Link to .js - http://a.quizlet.com/j/english/create_setku80j8.js
Link to site with link:
<a onclick="importText(); return false" href="#">Bulk-import data</a>
Actual URL of site: http://quizlet.com/create_set/
Beautified JS of relevant .js above: http://pastie.org/737042
You will probably have to reverse engineer the JavaScript to work out what is going on.
Can you provide the site and the link in question?
I don't immediately see any content-generation or link-following code in that script; all importText does is toggle whether a few divs are shown.
If you want to study the calls the webapp makes to do a particular action, in order to reproduce them from a bot, you're probably best off looking at the HTTP requests (form submissions and AJAX calls) that the browser makes whilst performing that action. You can use Firebug's ‘Net’ panel to study this for Firefox, or Fiddler for IE.