Help with screen scraping/parsing - javascript

I have been attempting to scrape and eventually parse some data(specifically availabilities and price) from hostels.com, for example http://www.hostels.com/hosteldetails.php/HostelNumber.11890. The problem is, once you select the number of nights and select "book now" nothing is passed through the URL string(its all done through Ajax, I belive) I cant go directly to a specific date or time frame.
I have attempted browser emulators such as Selenium, IRobotSoft and FakeApp and although I did get Selenium and Fake to do much of the work capturing the full source, it was ugly and still tedious when having to scrape(and parse with other software) multiple pages a day.
I have also tried HTML DOM Parser, PHP Scriptable Web Browser, HTMLUnit, cScrape.php, Crowbar. Either they couldn't handle the Ajax or I had no luck getting even them to run.
Ideally I would like something that can run from a server, with as few dependencies as possible, but at this point I would just like to get it running.
Now after spending many hours trying to get this working. I still feel I'm not sure where to begin. Can someone just point me in the right direction?. Should I go back and spend more time with HTMLUnit? what would be the best practice for a site like this?
Thanks

I'm really into Node.js atm (server-side javascript, in case you're not familiar), so that's what I'm recommending. What's awesome about using it to scrape sites is you can use jQuery or whatever your favorite JS framework is to do all the work of parsing for the info you want! See the following resources to get started:
http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga
https://github.com/tmpvar/jsdom
https://github.com/chriso/node.io/wiki/Scraping
https://github.com/joshfire/node-crawler

The page you are referring to does not seem to be using AJAX. Instead what you are referring to as AJAX is a POST request (as opposed to stuff passed in the url, which is a GET request). I suggest you read up on difference between them. Try to understand what going on, it is more important than relying on some third-party tool which might turn out to be very inflexible.
Install Firebug and watch which variables are sent in the POST request.
Now do the same thing in your favourite programming language. Parse the response HTML for the POST request for the necessary information.
Also, +1 for the effort of trying so many different solutions and not giving up.

I've found Celerity (http://celerity.rubyforge.org), a JRuby library that uses HTMLUnit under the hood, to be a very robust solution for "data acquisition via the Web".
Celerity being Ruby, I found, was much faster to develop with in comparison to full blown Java (HTMLUnit). Also, due to Celerity's "wrapping" of HTMLUnit -- I was able to drop down to HTMLUnit as I needed to do some heavier lifting.
I've had success with sites that are rich in DHTML, as well as utilize Ajax; and while I made have used some sleep() calls to wait on the Ajax responses -- everything worked as expected.
Give it a try!

Related

How can I have a page showing reservations update when a customer adds a reservation from another computer (using Rails)?

I would like to have a page where a restaurant can log in and see all of their current reservations/take-out orders, and I want this page to automatically update when someone (from another computer) makes a reservation or places an order. The idea is that the restaurant would leave this page open at all times to show their current status. What is the best way to do this? Can it be done without refreshing the page?
I wasn't even sure how to refer to a setup like this, so I wasn't really able to find much using Google. Is there a word for this type of setup?
I am using rails, and I am considering using AngularJS for the front end. Any suggestions?
There are two approaches to solving this.
The first, oldest, simplest is that your webpage contains some javascript that will poll the server at regular intervals (e.g. every 10-30 seconds), to check if something has changed and then add the changed data (e.g. reload a partial).
The second approach is a bit cleaner, and it allows the server to push the changed data to the connected clients, only when it is changed.
There are a few available approaches/libraries for this:
use websockets
use pusher
use juggernaut The author of juggernaut had deprecated it, in favor of using HTLM5 SSE (server sent events). Read more.
The advantage of using polling is that it is easy, works on every browser, but you have to write more code yourself, you will put some kind of load on your server, even if data has not changed (although the load is minimal).
The push-technologies are newer, work very clean, less code is needed. But some work only in newer browser (most of the times not really an issue), and some require extra support/setting up on your server-side.
On that note: pusher is really easy to get started with, and if your load is limited, it is free.
There are still a lot of others, but this should get you started in the right direction.

Understanding what AJAX is capable of and its limitations

I have heard about AJAX for years but I never felt the need or the intrest on learning it, I knew it was a mix of Javascript and XML but I never took the time to actully try to understand it, until now.
This is what I currently understand about AJAX. Ajax is not a language, it is just a combination of existing technologies, basically JavaScript and XML (and possibly HTML and CSS) and uses the XMLHttpRequest to comunicate with the server in the background to update/load only parts of a page instead of reloading the whole page.
Things I don't fully understand.
1- Is there any AJAX documentation or API that I can refer to to see what functions/options AJAX offers?
2- Why every book in Amazon seem to be old? Is this because AJAX this is not a language and doesn't change?
3- I read the tutorial at www.w3schools.com and I was wondering if what is shown in this tutorial is basically all AJAX can do, basically, Request and respoind to a server?
Again, all I'm trying to understand here is basically how much of a learning I still need to go through in order to have a better understanding about AJAX.
Thanks a lot
Long story short: AJAX lets you make calls to the server without submitting a form or navigating the page. That is all it does.
Originally it stood for "Asynchronous Javascript And XML" because the XMLHttpRequest object was designed to receive updates in XML format. Microsoft added the object so that the Outlook Web interface could pop up new mail alerts by polling the server.
Since then, most programmers have eschewed the use of XML as the data exchange protocol and rely on JSON instead. JSON is far easier to parse and work with.
While I could go through some examples of the low level XMLHttpRequest interactions, other sources have that well covered.
Instead, I'm going to give you a bit of advice. Study Javascript and consider learning the jQuery API. JQuery forces functional programming and makes common activities like AJAX calls super-simple to accomplish. You'll learn to be a better Javascript programmer because of it and will hopefully learn to make your sites more interactive thanks to the power that background server requests bring to the table.
Although the 'X' in AJAX stands for XML, applications today are more likely to use JSON encoding over XML as the return data can be evaluated directly by the browsers JavaScript interpreter. The core enabling JavaScript object is XMLHttpRequest which was originally developed as an ActiveX component for IE 5. It has since become a standard object in all web browser implementations. You can read about the core functionality here: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest.
Your best bet would be to research modern JavaScript frameworks such as jQuery. http://www.jquery.com/ for information on how to use AJAX technology within your web applications.
This is a bit of a vague question, and likely to get some down votes, but I think it's specific enough that it does warrant some information.
In a nutshell, AJAX is a why for JavaScript to request information asynchronously. The XML portion is a bit of a misnomer, since you don't have to explicitly deal with XML at all. Frequently, you'll use AJAX requests to read in JSON information (since it's so easy to parse and use).
AJAX isn't really a language, or even a framework. It's a technique. It is made possible by the XMLHttpRequest class, along with some related technologies. Since it isn't 100% consistent across all browsers, it is usually best to use a third-party library. jQuery and most other larger frameworks usually have it built in. You can also find some small AJAX-only libraries, like this XMLHttpRequest project on Github.
Every book on the technique is probably old because nothing has really changed substantially since the technique starting becoming popular. I've been using it for at least the past 3-5 years, and not much has changed (other than a bit more standardization in modern browsers).
The respond and request is basically all AJAX can do. However, that enables a whole world of possibilities. Long story short, it's a way to communicate with the server without having to refresh the page, allowing for much smoother UI and UX.
The simplest way to think about it is that it lets you fetch data without a page reload.
Think about how Google Maps loads in bits of the map as you drag around - it clearly doesn't load the map for the whole world.
In older map sites you clicked a left, right, up or down arrow, the page reloaded and the new data was shown.
AJAX lets you make pages feel much faster and smoother.
Technically JSON is usually used instead of XML as it's more Javascripty than XML.
Most sites likely use it somewhere or other, ranging from loading sidebar widgets after the main content, to the whole app, like Gmail.

Do you have any idea how Google Docs Javascript do the interval data autorefresh?

Alright, Here it goes:
I'm currently implementing a software which autorefresh/autopull/autoreload the data to keep the screen live by using AJAX.
This is actually working, but I know I´ve used the simplest approach which is:
SetInterval (javascript)
Call the Refresh Method over and over each n seconds.
Read the Json Data, rebuild the HTML and update it.
This can also be done by just calling a SetTimeOut (javascript) and the end of the AJAX request.
In the refresh method I internally check that it´s not being called simultaneously, etc.
However... this is the simplest approach, it works but, in slow computers, firefox and ie, I can see this activity sometimes freezes the browser, and I know this might not be necessary because of the AJAX call, but how "intensive" is the javascript operation overall... but, after running a profiler, Overall javascript (using jquery by the way) seem to be fine. Also if I disable the autorefresh, the browser wont freeze by short seconds in slow computers.
I decided to investigate how several of the majors AJAX applications works out there.
Facebook for instance.. they do a request all the time, every N seconds, interpret the JSON and update the screen, but, google docs... I can seem to find any request.. This is maybe because: they are just telling the javascript debugger engine that they do not want their request to be logged??, or, are they using another approach to the refresh dilemma?
I read in another answer here at stackoverflow, that Google Docs keeps an open connection..
Can this be the answer? http://ajaxpatterns.org/HTTP_Streaming
What do you guys know about this?
Just as a side note, the application I´m developing is meant to be accessed by thousands of users at a time, and I know the JavaScript refresh routine only tells a little part of the history, but the Server Side Application and the database is currently supporting such a load according to the stress tests I did by using several thousands of virtualized stations. I just want to know what you think about the client browser problem specifically.
Regards and
If you are still reading this..
Thanks you for your time.
I suspect they're using WebSockets. Browser support is flaky, so your mileage may vary with this approach.
You may also want to look at APE (ajax push engine), which is a decent implementation of long polling with a client/server architecture.
You can read up on Long Polling. But then you'll have to handle dropped connections etc.

How can we find the downloaded jquery plugin trying to connect to its developers site?

I am usually downloading several jQuery plugings.
How can I check whether the script is stealing any information (such as user cookie, session id..) and sending to its developer's server?
In php, we are checking backdoor scripts by looking for some functions (system, passthru, shell_exec, etc). Is there any such type of function in JavaScript to connect to its developers site?
Obviously, your first step should be to read the code. There are a number of tell-tale signs you can look for, including looking for URLs in the code, and any encrypted code.
Of course, some code may be too complex to make this a realistic suggestion, particularly if it's been minified and obfuscated, but it should be possible to scan through it. If it is doing anything like this, it'll be using the same functions it uses to communicated with your own site (ie jQuery's ajax functions), so you won't see specific function calls that raise suspicion, but suspect URLs in the code should be checked out, and you should definitely avoid encrypted code (obfuscated is generally okay, but not encrypted).
Secondly, search the internet for other people commenting about the plugin. If there is anything untoward happening, its likely that other people will have noticed it. Avoid using plugins that don't have enough users to get any comments one way or the other.
Finally, use a tool like Firebug to watch for HTTP requests that occur while you're using a site containing the plugin. If it's communicating with base, it can't hide from you; the browser's debugging tools will happily show you what you need to know.
Hope that helps.
I don't think you can do anything else than read the whole code, and check if it is stealing anything.
Another thing you could do, is to search in the codes after words like 'document.cookie' and 'navigator' and other things that are necesary for stealing information.

Javascript based spell-checkers for web applications

I have just received a requirement to implement spell checking on a web application that we are creating. I know all about FF, Chrome, IESpell, etc. but this one is the client's request.
Given that the only way to implement something like this (real time) is with JavaScript libraries, I want to know has anyone tried any of the open source ones? Were they any good? In general, what types of good/bad things can be said about this approach?
I guess going into this, I am against it as it is just more work for the end users's machine to do for little benefit. I guess what I mean by that is that it will be a script that is constantly doing something as opposed to an AJAX request or a quick div update which could lead to seemingly bad performance for our application even though it is a spell checker checking every input field on the page. It seems also that there is lots of room for a javascript error to stall the entire site.
Thoughts?
I agree that a spell-checker should be native if it's running at all times. If the client demands an explicit spell checker, though, it should be implemented as a button to be clicked when needed. It might also be worth firing that XHR request after the user has stopped typing for a certain amount of time, like SO does for syntax highlighting when writing a post.
I used After the Deadline for my school newspaper's back-end spell-checker, since it is powerful, checked for simple grammar errors as well, and integrated easily with TinyMCE. There's also a jQuery plugin to integrate with the service.
I've done some research on this problem for a web application that I am planning.
Googie Spell is very good, you can use their servers or run your own python backend.
There's a demo here.

Categories

Resources