I am trying to scrape some data from the following web page:
College Board - Georgia Institute of Technology
But the information I need to access is only displayed after pressing the "Applying" tab on the left. Since the URL does not change, how can I simulate pressing the button in order to scrape the HTML?
I am using Python3.3 and the requests module.
According to the page source, the information you need is hidden inside a javascript code and is calculated and rendered after the click on "Applying" link.
requests simply cannot make in-browser user actions and, since there is no additional requests going after clicking "Applying", you cannot get the data without actually having a real browser to run that js code. Mechanize also wouldn't help because it cannot handle js.
Consider using selenium (FYI, you can also use a headless PhantomJS browser).
Hope that helps.
Related
I'm having troubles while dealing with pages that contain javascript link. The problem is the page contains list of cities which have javascript in their links. Now I've to navigate to each link one by one, scrape some information and then come back to the list and move to the next city and continue scraping.
The problem is after clicking on the javascript links using selenium web driver, when I navigate back to the list page the response is lost and I get error like :
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"id","selector":"some_id"}'
Is there any way around ?
Usually, javascript only handle posts for ajax requests. You just have to understand what the javascript is doing and reimplement that same logic within Scrapy, without any need for Selenium.
For that, just use the Chrome Developer tools or Firefox Firebug.
If you need variables that are set in javascript, I guess you can use a javascript engine like v8 (in a wrapper like pyV8).
Do you have an example that we can help you on ?
We are looking for Javascript API to screen scrape page flow, including button click. If it were on server side, WebDriver of Selenium would have been a great choice, but we want the screen scraping to run on the client browser. The screens to be scrapped is a transaction in itself (login to third party website, transaction step 1, step 2 and then final confirmation). Any javascript API available?
AFAIK, both nodeJS and phantom JS don't have capability to click a button from the scrapped page.
thanks in advance,
abbas
Webdriver is an HTTP based protocol, something that every browser speaks, so it is possible to control one browser from another. I've written a tutorial some weeks ago on that topic here
I recently ran into DalekJS (http://dalekjs.com/docs/actions.html) which allows for taking screenshots of pages and clicking on the elements as well. I think it even supports multiple browsers -- although they have to be installed (http://dalekjs.com/pages/documentation.html#browsers).
here is the sample code directly from the link:
test.open('http://doctorwhotv.co.uk/')
.assert.text('#nav').is('Navigation')
.assert.visible('#nav')
.assert.attr('#nav', 'data-nav', 'true')
.click('#nav')
.done();
In traditional web application i generally write JSP which renders html code to browser and communicate to server using form submit or through Java script. This generally involves page transition from one to another using browser refresh many times.
Now with the improved HTML5 i still can use the same approach but i want to achieve more of a desktop application look and feel which means no browser refresh. But i am really confused how it can be achieved.
Do i need to write a big single html5 file which contains all the web application code and show or hide divisions using java script that we need to show at that point of time. Communicate to server using java script.
Or, Just have a minimal first html5 page where user lands for the first time. Later on create all the HTML5 content dynamically using java script and communicate to server using java script. This looks more difficult.
Or, is there a way we can move from one page to other without the effect of page loading/refresh etc.
In general using HTML5 what should be the approch?
For example of a shopping cart, the first view to the user is list of items to purchase. Then user moves to next view such as details of an item. The next view can be payment.
If you have some resource or example to explain it, it would be great.
I was wondering if there was a way to prevent a user from saving/downloading a web page? Specifically, I mean not letting them have access to the data displayed through my web application on their own machine?
I've heard that this is not possible since the browser must have access to the source code/data, but at the same time, I've noticed that if I to my gmail account, open an email, save the page, but when I try to open that page on my computer, it doesn't work. Furthermore, if I click "view source", I can see that even the source does not display the entire email message, even though the email is opened in my browser.
How it's possible for gmail to prevent me from seeing that email data?
Thats what called rendering pages using dynamic data without refreshing page (AJAX). The entire page source code is not downloaded in one go and components within the page request data asynchronously to display content. Try googling it and you will find more information.
In View source you can only see the HTML, CSS, JavaScript codes. No one can copy any dynamic code (PHP) from view source.
You can't stop anyone to see your html,css code in browser, as we are having view source option.
Maximum what you can do is disable right click on your page. Thant can be done through JavaScript.
I have a site affiliated with a university and we want to link to another site that has a certain teaching program.
How can we track the number of times this link has been clicked from within our website?
I would use jquery and/or ajax to touch a page in the background (ajax) that counts hits whenever a link is clicked and then proceed to allow the link do what it does.
You can use web analytics tools such as Google Analytics or commercial software such as WebTrends.
Instead of directly providing the external link, link to a page on your own site that redirects to the link, and it will be logged like every other request.
The sky is really the limit when it comes to link tracking. It really depends on your expertise.
You can use a service like bit.ly to track the clicks on the link. Bit.ly is mostly used a s shortner service, but if you sign up for bit.ly (actually make an account) You can keep track on the links that you generate and how much they are clicked.
If you want to install something on your server to track the links you can use something like:
http://www.phpjunkyard.com/php-click-counter.php Its a simple redirect script that where you give it a link, and it will give you back a link that it can track. It keeps track of all the click. This script is super simple and does not require you to use a mysql database and you don't have to have a huge knowledge of programming to install it.
Most reliable method is using a redirector to measure the traffic going through, no need for JavaScript (e.g. the phpjunkyard.com link given) as long as you can rely on your server to redirect without problems.
If a server side option is not available, using a web-analytics tool simply to count link clicks doesn't make sense, so the JavaScript option could be used, but you still need something to count the clicks in to.
If you would like to us a web-analytics tool to e.g. better understand your visitors, it's a different story; All (not all require it, but all use it if possible) the WA tools use a 1x1 pixel GIF to record calls and read the incoming data. GA is free, but you would have to code the link click (simple though). Piwik hosting is available really cheap and would do the trick. WebTrends and other tools are way too hardcore for this kind of requirement.