Scraping a remote URL for bookmarking service without getting blocked

Scraping a remote URL for bookmarking service without getting blocked - javascript

I'm using a server-side node.js function to get the text of a URL passed by the browser, to auto-index that url in a bookmarking service. I use jsdom for server-side rendering. BUT I get blocked from popular sites, despite the requests originating from legitimate users.
Is there a way to implement the URL text extraction on the browser side, such that requests would always seem to be coming from a normal distribution of users? How do I get around the cross-site security limitations in the browser? I only need the final DOM-rendered text.
Is a bookmarklet the best solution? When the user wants to bookmark the page, I just append a form in a bookmarklet and submit the DOM-rendered text in my bookmarklet?
I know SO hates debates, but any guidance on good methods would be much appreciated.

You could certainly do it client-side but I think that would be overly complex. The client would have to send the html to your service & that would require very careful sanitising & might be difficult to control the volume of incoming data.
I would probably simply track the request domains and ensure that I limited the frequency of calls to any single domain. That should be fairly straight forward if using something like Node.JS where you could easily set up any number of background fetch tasks. This would also allow you to fine tune the bandwidth used.

Related

Http request to a website to get the content of a specific html element

I am building a site to help students schedule their university courses. It will include things like days, times, professor, etc. I want to fetch the "rating" of professors off www.ratemyprofessors.com and have it show on my site. For example, at https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1230754 you can see Michael has a rating of 4.6. I want to request that data and have it show on the site. I can't scrape it beforehand as their ratings change and I want it to show their current rating. Am I able to do this with an XmlHttpRequest? How would I do that? I'm hoping to do it in JavaScript.

Browser won't let http requests towards third party websites leave your webpage unless the target site allows it. This is called CORS. See https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS. While you may be lucky if that site allows (or doesn't disallow), that may change in the future, leaving you in a bind (malfuntioning feature).
Also, what you're planning to do is called web scraping, and typically it isn't favored by webmasters, so you might eventually get blocked or stumble upon a change in content markup, again leaving you in the same bind.
I would ask the owner of that site for permission and, perhaps, API access.
Otherwise, your option #1 is to try making that http-request from the browser-level script (yes, you can use ajax, XmlHttpRequest, the new fetch API, or a third-party script), which will work only if CORS isn't a problem.
Your option #2 is to make the same request from the server (so, ajax to your server app, which scrapes the remote site), and this would be the workaround for the potential CORS problem. Again, CORS is an obstacle only at the browser level, cause browsers are coded to intercept that to minimize potential harm to user's data. However, this option is subject to eventually having your server blocked from accessing the remote site, which would be done by that site's owner and by simply configuring it to not accept connections from IP addresses that they detect as belonging to your site. Pretty cool, huh?
Both of these options are further subject to the problem of dealing with content changes, which would be in hands of your post-request script, whether executing at the browser (option 1) or at the server (option 2), which could be an ongoing maintenance. Either way, craft it in such a way to treat that 3rd-party data as a nice-to-have (so, don't crash your page when fetching that other data fails).
Edit: I would have to try this to be certain, but it's something to think about: you could embed a hidden iframe in your page, targetting that remote webpage (as in your example), then parse the iframe's content once it's available. Note that this endeavor (did I spell that right) is not trivial AT ALL, and it would cost quite a chunk of development time (and it wouldn't be a task a beginner could reasonably complete, at least not quickly), and - again - I am not 100% certain that it would even be possible, as the iframe-hosting webpage may not have access to iframe's content when it'a served by a 3rd-party website. So, this would potentially be option #3, and it would be at-browser solution (so, lots of javascript), however not susceptible to CORS blocking. Phew, a lot of words, I know - but they do make sense, if you can believe me.
Hope that helps decide. Good luck.

Web services API Keys and Ajax - Securing the Key

This is probably a generic security question, but I thought I'd ask in the realm of what I'm developing.
The scenario is: A web service (WCF Web Api) that uses an API Key to validate and tell me who the user is, and a mix of jQuery and application on the front ends.
On the one hand, the traffic can be https so it cannot be inspected, but if I use the same key per user (say a guid), and I am using it in both then there's the chance it could be taken and someone could impersonate the user.
If I implement something akin to OAuth, then a user and a per-app key is generated, and that could work - but still for the jQuery side I would need the app API key in the javascript.
This would only be a problem if someone was on the actual computer and did a view-source.
What should I do?
md5 or encrypt the key somehow?
Put the key in a session variable, then when using ajax retrieve it?
Get over it, it's not that big a deal/problem.
I'm sure it's probably a common problem - so any pointers would be welcome.
To make this clearer - this is my API I have written that I am querying against, not a google, etc. So I can do per session tokens, etc, I'm just trying to work out the best way to secure the client side tokens/keys that I would use.
I'm being a bit overly cautious here, but just using this to learn.

(I suggest tagging this post "security".)
First, you should be clear about what you're protecting against. Can you trust the client at all? A crafty user could stick a Greasemonkey script on your page and call exactly the code that your UI calls to send requests. Hiding everything in a Javascript closure only means you need a debugger; it doesn't make an attack impossible. Firebug can trace HTTPS requests. Also consider a compromised client: is there a keylogger installed? Is the entire system secretly running virtualized so that an attacker can inspect any part of memory at any time at their leisure? Security when you're as exposed as a webapp is is really tricky.
Nonetheless, here are a few things for you to consider:
Consider not actually using keys but rather HMAC hashes of, e.g., a token you give immediately upon authentication.
DOM storage can be a bit harder to poke at than cookies.
Have a look at Google's implementation of OAuth 2 for an example security model. Basically you use tokens that are only valid for a limited time (and perhaps for a single IP address). That way even if the token is intercepted or cloned, it's only valid for a short length of time. Of course you need to be careful about what you do when the token runs out; could an attacker just do the same thing your code does and get a new valid token?
Don't neglect server-side security: even if your client should have checked before submitting the request, check again on the server if the user actually has permission to do what they're asking. In fact, this advice may obviate most of the above.

It depends on how the API key is used. API keys like that provided by Google are tied to the URL of the site originating the request; if you try and use the key on a site with an alternate URL then the service throws and error thus removing the need to protect the key on the client side.
Some basic API's however are tied to a client and can be used across multiple domains, so in this instance I have previously gone with the practice of wrapping this API in server side code and placing some restrictions on how the client can communicate with the local service and protecting the service.
My overall recommendation however would be to apply restrictions on the Web API around how keys can be used and thus removes the complications and necessity of trying to protect them on the client.

How about using jQuery to call server side code that handles communication with the API. If you are using MVC you can call a controller action that can contain the code and API key to hit your service and return a partial view (or even JSON) to your UX. If you are using web forms you could create an aspx page that will do the API communication in the code behind and then write content to the response stream for your UX to consume. Then your UX code can just contain some $.post() or $.load() calls to your server side code and both your API key and endpoint would be protected.

Generally in cases like this though you proxy requests through the server using 'AJAX' which verifies the browser making requests is authorized to do so. If you want to call the service directly from JavaScript, then you need some kind of token system like JSON Web Tokens (JWT) and you'll have to work out cross-domain issues if the service is located somewhere other than the current domain.

see http://blogs.msdn.com/b/rjacobs/archive/2010/06/14/how-to-do-api-key-verification-for-rest-services-in-net-4.aspx for more information
(How to do API Key Verification for REST Services in .NET 4)

Is there any way to verify that client side code that is used is the one given by the server?

In a previous question I asked about weaknesses in my own security layer concept... It relies on JavaScript cryptography functions and thanks to the answers now the striking point is clear that everything that is done in Javascript can be manipulated and can not be trusted...
The problem now is - I still need to use those, even if I rely on SSL for transmission...
So I want to ask - is there a way that the server can check that the site is using the "correct" javascript from the server?
Anything that comes to my mind (like hashing etc.) can be obviously faked... and the server doesn't seem to have any possibility to know whats going on at the clients side after it sent it some data, expept by HTTP headers (-> cookie exchange and stuff)

It is completely impossible for the server to verify this.
All interactions between the Javascript and the server come directly from the Javascript.
Therefore, malicious Javascript can do anything your benign Javascript can do.
By using SSL, you can make it difficult or impossible for malicious Javascript to enter your page in the first place (as long as you trust the browser and its addons), but once it gets a foothold in your page, you're hosed.
Basically, if the attacker has physical (or scriptual) access to the browser, you can no longer trust anything.

This problem doesn't really have anything to do with javascript. It's simply not possible for any server application (web or otherwise) to ensure that processing on a client machine was performed by known/trusted code. The use of javascript in web applications makes tampering relatively trivial, but you would have exactly the same problem if you were distributing compiled code.
Everything a server receives from a client is data, and there is no way to ensure that it is your expected client code that is sending that data. Any part of the data that you might use to identify your expected client can be created just as easily by a substitute client.
If you're concern is substitution of the client code via a man-in-the-middle attack, loading the javascript over https is pretty much your best bet. However, there is nothing that will protect you against direct substitution of the client code on the client machine itself.

Never assume that clients are using the client software you wrote. It's an impossible problem and any solutions you devise will only slow and not prevent attacks.
You may be able to authenticate users but you will never be able to reliably authenticate what software they are using. A corollary to this is to never trust data that clients provide. Some attacks, for example Cross-Site Request Forgery (CSRF), require us to not even trust that the authenticated user even meant to provide the data.

What's the point of the Anti-Cross-Domain policy?

Why did the creators of the HTML DOM and/or Javascript decide to disallow cross-domain requests?
I can see some very small security benefits of disallowing it but in the long run it seems to be an attempt at making Javascript injection attacks have less power. That is all moot anyway with JSONP, it just means that the javascript code is a tiny bit more difficult to make and you have to have server-side cooperation(though it could be your own server)

The actual cross-domain issue is huge. Suppose SuperBank.com internally sends a request to http://www.superbank.com/transfer?amount=100&to=123456 to transfer $10,000 to account number 123456. If I can get you to my website, and you are logged in at SuperBank, all I have to do is send an AJAX request to SuperBank.com to move thousands of dollars from your account to mine.
The reason JSON-P is acceptable is that it is pretty darn impossible for it to be abused. A website using JSON-P is pretty much declaring the data to be public information, since that format is too inconvenient to ever be used otherwise. But if it's unclear as to whether or not data is public information, the browser must assume that it is not.

When cross-domain scripting is allowed (or hacked by a clever Javascripter), a webpage can access data from another webpage. Example: joeblow.com could access your Gmail while you have mail.google.com open. joeblow.com could read your email, spam your contacts, spoof mail from you, delete your mail, or any number of bad things.

To clarify some of the ideas in the questions into a specific use case..
The cross domain policy is generally not there to protect you from yourself. Its to protect the users of your website from the other users of your website (XSS).
Imagine you had a website that allowed people to enter any text they want, including javascript. Some malicious user decides to add some javascript to the "about yourself" field. Users of your website would navigate his profile and have this script executed on their browser. This script, since its being executed on your website's behalf, has access to cookies and such from your website.
If the browser allowed for cross domain communication, this script could theoretically collect your info and then upload it to a server that the malicious user would own.

Here's a distinction for you: Cross-domain AJAX allows a malicious site to make your browser to things on its behalf, while JSON-P allows a malicious server to tamper with a single domain's pages (and to make the browser do things to that domain on your behalf) but (crucial bit) only if the page served went out of its way to load the malicious payload.
So yes, JSON-P has some security implications, but they are strictly opt-in on the part of the website using them. Allowing general cross-domain AJAX opens up a much larger can of worms.

Ajax Security

We have a heavy Ajax dependent application. What are the good ways of making it sure that the request to server side scripts are not coming through standalone programs and are through an actual user sitting on a browser

There aren't any really.
Any request sent through a browser can be faked up by standalone programs.
At the end of the day does it really matter? If you're worried then make sure requests are authenticated and authorised and your authentication process is good (remember Ajax sends browser cookies - so your "normal" authentication will work just fine). Just remember that, of course, standalone programs can authenticate too.

What are the good ways of making it sure that the request to server side scripts are not coming through standalone programs and are through an actual user sitting on a browser
There are no ways. A browser is indistinguishable from a standalone program; a browser can be automated.
You can't trust any input from the client side. If you are relying on client-side co-operation for any security purpose, you're doomed.

There isn't a way to automatically block "non browser user" requests hitting your server side scripts, but there are ways to identify which scripts have been triggered by your application and which haven't.
This is usually done using something called "crumbs". The basic idea is that the page making the AJAX request should generate (server side) a unique token (which is typically a hash of unix timestamp + salt + secret). This token and timestamp should be passed as parameters to the AJAX request. The AJAX handler script will first check this token (and the validity of the unix timestamp e.g. if it falls within 5 minutes of the token timestamp). If the token checks out, you can then proceed to fulfill this request. Usually, this token generation + checking can be coded up as an Apache module so that it is triggered automatically and is separate from the application logic.
Fraudulent scripts won't be able to generate valid tokens (unless they figure out your algorithm) and so you can safely ignore them.
Keep in mind that storing a token in the session is also another way, but that won't buy any more security than your site's authentication system.

I'm not sure what you are worried about. From where I sit I can see three things your question can be related to:
First, you may want to prevent unauthorized users from making a valid request. This is resolve by using the browser's cookie to store a session ID. The session ID needs to tied to the user, be regenerated every time the user goes through the login process and must have an inactivity timeout. Anybody request coming in without a valid session ID you simply reject.
Second, you may want to prevent a third party from doing a replay attacks against your site (i.e. sniffing an inocent user's traffic and then sending the same calls over). The easy solution is to go over https for this. The SSL layer will prevent somebody from replaying any part of the traffic. This comes at a cost on the server side so you want to make sure that you really cannot take that risk.
Third, you may want to prevent somebody from using your API (that's what AJAX calls are in the end) to implement his own client to your site. For this there is very little you can do. You can always look for the appropriate User-Agent but that's easy to fake and will be probably the first thing somebody trying to use your API will think of. You can always implement some statistics, for example looking at the average AJAX requests per minute on a per user basis and see if some user are way above your average. It's hard to implement and it's only usefull if you are trying to prevent automated clients reacting faster than human can.

Is Safari a webbrowser for you?
If it is, the same engine you got in many applications, just to say those using QT QWebKit libraries. So I would say, no way to recognize it.
User can forge any request one wants - faking the headers like UserAgent any they like...
One question: why would you want to do what you ask for? What's the diffrence for you if they request from browser or from anythning else?
Can't think of one reason you'd call "security" here.
If you still want to do this, for whatever reason, think about making your own application, with a browser embedded. It could somehow authenticate to the application in every request - then you'd only send a valid responses to your application's browser.
User would still be able to reverse engineer the application though.

Interesting question.
What about browsers embedded in applications? Would you mind those?
You can probably think of a way of "proving" that a request comes from a browser, but it will ultimately be heuristic. The line between browser and application is blurry (e.g. embedded browser) and you'd always run the risk of rejecting users from unexpected browsers (or unexpected versions thereof).

As been mentioned before there is no way of accomplishing this... But there is a thing to note, useful for preventing against CSRF attacks that target the specific AJAX functionality; like setting a custom header with help of the AJAX object, and verifying that header on the server side.
And if in the value of that header, you set a random (one time use) token you can prevent automated attacks.

Develop Reference

JavaScript is the programming language of the Web.