How do I get the Document object from a separate URL? - javascript

I'm making a program to parse a bunch of information from several pages of a database website for a JavaScript library. Just by messing around with the console, I've figured out how to isolate the information I need, but I don't know how to access the information from the parsing program. I'm almost exclusively proficient at HTML/JavaScript, so naturally I'm sticking with what I know for what should be a relatively simple parser. Here's the basic idea of what I intend to do:
for (var i = 0; i < 5; i++) {
var outsideHTML = getDocumentByURL("https://www.example-database.com/page-"+i);
//other code that parses information from the variable "outsideHTML"
}
I just need a function to serve as getDocumentByURL(). Thanks in advance.

Due to something called same-origin policy, you cannot access the document or other contents of another webpage using client-side JavaScript unless the other webpage is on the same domain as your webpage, or unless the other website explicitly allows it by supporting JSONP or setting Access-Control-Allow-Origin headers. It sounds like you're trying to retrieve a webpage, so JSONP isn't relevant, and it's fairly uncommon for Access-Control-Allow-Origin headers to be set on a webpage. Thus, this probably isn't possible to accomplish in the way you've described.
To retrieve the data from the other website, there's a couple approaches you can take:
Run server-side code (i.e. PHP, Node.js, Java, etc) that retrieves the other webpage and extracts the information you need. Server-side code is not affected by browser security policies such as same origin policy.
Use a cross-origin proxy (such as crossorigin.me). This proxy will retrieve the data for you and add the Access-Control-Allow-Origin headers that allow you to access the page contents.
Depending on what you're trying to achieve, you might transform your idea from a webpage to a browser extension -- browser extensions can be given the freedom to ignore same-origin policy.
Ask the website owner if they'd be willing to accommodate you by providing the data in a friendlier format.
Note that both of the first two approaches result in the request being made from a server rather than from the client's computer. This means that you can't retrieve any information that requires that they be logged into the website.

Related

Use JavaScript to crawl a website -> Possible and which IP is shown on the crawled site

it is possible to crawl a website within an Angular-App? I am speaking about to call a website from Angular, not crawling an Angular-App. If that so, then I am wondering which IP will be shown on the crawled website. Since JavaScript is client-side, I would suggest, its the IP of the client, not of the server (like probably at nodejs). But all I know, its mostly browser-implemented stuff what we can use in JS, so it is even possible to crawl websites with methods from JavaScript (or Angular)?
Best Regards
Buzz
In theory, you can create an AJAX request to fetch the data with reponse type text/html. That would give you the remote document as a string. The browser wouldn't try to load the JavaScript and CSS in that document, though. That might not be a problem but CORS is. For security reasons, most browsers prevent you from loading data from somewhere else (otherwise, it would be too easy for criminals to put JavaScript into any web page). See here for details: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS
If you have control over the second domain, you can configure the server there to send Access-Control-Allow-Origin headers to the browser to allow access from the Angular App.
Note: You could use an iframe to load the other website but when the domains of the current document and the one in the iframe don't match, then you can't access the contents of the iframe from JavaScript.
One way to work around this is to install a proxy on your server. The browser can then ask your server for the pages in question. In this case, the remote web site will get the IP of your server.

Prevent local PHP/HTML files preview from executing javascript on server

I have some HTML/PHP pages that include javascript calls.
Those calls points on JS/PHP methods included into a library (PIWIK) stored onto a distant server.
They are triggered using an http://www.domainname.com/ prefix to point the correct files.
I cannot modify the source code of the library.
When my own HTML/PHP pages are locally previewed within a browser, I mean using a c:\xxxx kind path, not a localhost://xxxx one, the distant script are called and do their process.
I don't want this to happen, only allowing those scripts to execute if they are called from a www.domainname.com page.
Can you help me to secure this ?
One can for sure directly bypass this security modifying the web pages on-the-fly with some browser add-on while browsing the real web site, but it's a little bit harder to achieve.
I've opened an issue onto the PIWIK issue tracker, but I would like to secure and protect my web site and the according statistics as soon as possible from this issue, waiting for a further Piwik update.
EDIT
The process I'd like to put in place would be :
Someone opens a page from anywhere than www.domainname.com
> this page calls a JS method on a distant server (or not, may be copied locally),
> this script calls a php script on the distant server
> the PHP script says "hey, from where damn do yo call me, go to hell !". Or the PHP script just do not execute....
I've tried to play with .htaccess for that, but as any JS script must be on a client, it blocks also the legitimate calls from www.domainname.com
Untested, but I think you can use php_sapi_name() or the PHP_SAPI constant to detect the interface PHP is using, and do logic accordingly.
Not wanting to sound cheeky, but your situation sounds rather scary and I would advise searching for some PHP configuration best practices regarding security ;)
Edit after the question has been amended twice:
Now the problem is more clear. But you will struggle to secure this if the JavaScript and PHP are not on the same server.
If they are not on the same server, you will be reliant on HTTP headers (like the Referer or Origin header) which are fakeable.
But PIWIK already tracks the referer ("Piwik uses first-party cookies to keep track some information (number of visits, original referrer, and unique visitor ID)" so you can discount hits from invalid referrers.
If that is not enough, the standard way of being sure that the request to a web service comes from a verified source is to use a standard Cross-Site Request Forgery prevention technique -- a CSRF "token", sometimes also called "crumb" or "nonce", and as this is analytics software I would be surprised if PIWIK does not do this already, if it is possible with their architecture. I would ask them.
Most web frameworks these days have CSRF token generators & API's you should be able to make use of, it's not hard to make your own, but if you cannot amend the JS you will have problems passing the token around. Again PIWIK JS API may have methods for passing session ID's & similar data around.
Original answer
This can be accomplished with a Content Security Policy to restrict the domains that scripts can be called from:
CSP defines the Content-Security-Policy HTTP header that allows you to create a whitelist of sources of trusted content, and instructs the browser to only execute or render resources from those sources.
Therefore, you can set the script policy to self to only allow scripts from your current domain (the filing system) to be executed. Any remote ones will not be allowed.
Normally this would only be available from a source where you get set HTTP headers, but as you are running from the local filing system this is not possible. However, you may be able to get around this with the http-equiv <meta> tag:
Authors who are unable to support signaling via HTTP headers can use tags with http-equiv="X-Content-Security-Policy" to define their policies. HTTP header-based policy will take precedence over tag-based policy if both are present.
Answer after question edit
Look into the Referer or Origin HTTP headers. Referer is available for most requests, however it is not sent from HTTPS resources in the browser and if the user has a proxy or privacy plugin installed it may block this header.
Origin is available for XHR requests only made cross domain, or even same domain for some browsers.
You will be able to check that these headers contain your domain where you will want the scripts to be called from. See here for how to do this with htaccess.
At the end of the day this doesn't make it secure, but as in your own words will make it a little bit harder to achieve.

including external URL javascript file

I have a main header page that is included in many different applications across a couple of different languages, including Java and classic ASP. The file (file.js) is going to be obsolete soon. We are going to be going to an "out-of-the-box" solution, a new header created by another group. They gave us a link ("google.com") that we need to use to show this new header. I was wondering if there was a simple solution I could implement in my file.js that would show this content to the users. I know an easy way to do it in jsp is
<c:import url="http://google.com"/>
but this won't work in the js file, nor will it work in the jsp. Is there a way for me to do this?
Thank you,
Explosive_donut
Obviously the URL you are really given isn't Google. I suppose the second team is able to modify their own (document) headers sent to clients.
First way I think of is to use AJAX to retrieve the contents of the URL and create a div or select an existing to set its new content.
Unfortunately AJAX is restricted to the Same Origin Policy which can be circumvented with CORS (Cross Origin Resource Sharing). To allow CORS, your remote server as well as your client maschines need to send respective headers. Check out the link for more information.
If you need any more information and/or tutorials, let me know in the comments.

XSS - Send data to the source server of a script?

I am writing a JavaScript application where I plan on host the code on a CDN. Now I plan to include this code to my clients' sites. However, I have a problem, I want to use AJAX to communicate between the client and the server. Now, from my understanding of XSS, this is not possible.
Ex:
User visits site.com, where a script tag's source is pointing to a file on cdn.somedomain.com
The script on cdn.somedomain.com fires an event.
This event will communicate with a PHP. I know it is possible for the script from cdn.somedomain.com to request documents on site.com. However, is it possible to send data back to a PHP file on cdn.somedomain.com?
Thanks for helping an entrepenuer! :D
The short is I think this is possible, but it depends on a couple of things. The same origin policy is a weird thing in that it won't allow cross domain reads, but will allow cross domain writes.
I think a way you could accomplish your goal is by making a GET request (minimally by creating an iframe, img, or whatever else that pulls a src) or possibly even using AJAX. If your goal is to only send data, then that should be fine. However, if you want to read this data back then I think that'll be a little less straight forward. I can't really answer that right now - especially without knowing more details about your system setup.
Sounds like a weird use of a cdn. Normally cdns serve static assets, so you wouldnt put a php file there. In fact the cdn wouldnt normally run dynamic server side code at all.
You can address the problem in several ways. Newer browsers support CORS and cross domain ajax. The cdn would then have to use the Access-control-* headers. You could also look at something like easyXDM, which works in older browsers.

Retrieve a cross domain RSS(xml) through Javascript

I have seen server side proxy workarounds for retrieving rss (xmls) from cross-domains. In fact this very question addressess my same problem but gives out a different solution.
I have a constraint of do not use a proxy to retrieve rss feeds. And hence the Google AJAX Feed API solution also goes out of picture. Is there a client-only workaround for this problem.
JSONP is the solution for requests that respond with JSON output. But here, I have RSS feeds which can respond with pure xml .
How do I solve the problem.
Use something like Yahoo! Pipes to serve as your proxy and translate the RSS XML into a JSON response.
Here is an article with instructions and code samples that explains how to do it: Yahoo Pipes--RSS without Server Side Scripts.
If you have control over both domains, you can try a cross-domain scripting library like EasyXDM, which wraps cross-browser quirks and provides an easy-to-use API for communicating in client script between different domains using the best available mechanism for that browser (e.g. postMessage if available, other mechanisms if not).
Caveat: you need to have control over both domains in order to make it work (where "control" means you can place static files on both of them). But you don't need any server-side code changes.
Another Caveat: there are security implications here-- make sure you trust the other domain's script!
Right now there really isn't a cross-platform solution for cross-site scripting. Do you have control or access to the RSS feeds? If so, why not simply respond with JSON and use JSONP?
There are other things coming down the pike with HTML5, like cross-site messaging (referred to as Cross-Document Messaging) that may be capable of delivering a payload of XML, but last time I checked, they hadn't even fully decided on a size limit for the messaging.
You can see the spec here: http://dev.w3.org/html5/spec/Overview.html#crossDocumentMessages
A solution for cross-domain calls without a server-side proxy is to use a SWF component.
You can script yourself one or use the readily available FLSend
The component uses ActionScript's URLRequest to call remote domains and ExternalInterface to communicate with the JavaScript methods that render your content.
The only way I can think of would be to embed a signed java applet on the webpage to retrive the xml and use javascript to interface with that. I'm not even 100% certain what the java security model is for that at present though but I think it would work.

Categories

Resources