I'm building an app in AngularJS and I was thinking about using SEO URLs. Currently my states are as follows:
article/page/:page - for a paginated listing of articles
article/:id - to view details of a single article
I thought about adding another state as follows:
article/:id/:seo
I would completely ignore the :seo state parameter, so no matter if the URL was
article/25/some-article-title or article/25/something-different, it would still display the exactly same article.
I would then simply link people to those URLs, but still ignoring the seo state parameter. It would only serve an informational purpose.
My question is if this is acceptable. All I found are high end javascript SEO frameworks on top of an already heavy framework. I don't need that. I just want a simple way to show people readable URLs.
As long as the canonical URL isset, the overall vanity URL matters little as the canonical one will be indexed.
A commercial example, ESPN do this all the time. Sometimes the URL conforms, sometime's it redirects to the correct page and if all else fails it just redirects to a search.
But no matter how many variations of the URL and how malformed it is, it takes away little, if anything, from the indexed content of the page.
However, if the ratio of total links to a url for a 301 redirect is overwhleming, it is indeed possible for the vanity url to appear in indexes despite it being a 301 redirect.
Again, ESPN as an example, searching the term site:espn.com/billsimmons will show an indexed page that will redirect to http://search.espn.go.com/bill-simmons/.
As long as you set the canonical URL, you should be fine.
Related
I'm getting Client DOM Open Redirect security issue on scan for the following piece of code.
The issue shows up where I'm initializing the variables 'myPath' and 'myHost'.
I'm not able to understand how is that open to phising attack and how do I fix it.
Could anyone please help?
var query = this.value;
var myPath = window.location.href.substring(window.location.href.indexOf("/myapp"), window.location.href.indexOf(".html"));
var myHost = window.location.href.substring(0, window.location.href.indexOf("/myapp"));
query = encodeURIComponent(query);
if(myPath!==null){
query = query + "mysite:" + myHost + myPath;
}
The problem is that you are taking user input (values from the url bar) and you redirect to it. This may or may not be exploitable in any meaningful attack, but static scanners don't understand your application logic - for a static scanner it's just user input that will directly be used in a redirect.
Based on the info in the question, I can't think of a valid attack, because I think it just makes the same url that the user already visited, without .html in the end, and also without the # part if there was any. So I think the user will be redirected to a page that he visited already. However, that doesn't at all mean there is no vulnerability, and it also depends on other related code. What happens when the user can access a page with this code without any .html in the url bar would for example affect the outcome, so would another vulnerability that allows (partial) control of the url bar, a possible scenario for something like an SPA. So there is not enough info in the question to decide whether it's actually secure or not.
As for the fix, make sure you only redirect where you want to, and not to any user input. For example the host part (maybe even the path) could be written in the page by the server, but I understand that would not be the case for something like an SPA. You could implement whitelist validation to ensure no rogue redirects happen. Maybe it's already good, in which case you can set this finding to mitigated in the scanner you used, but think about edge cases, and how this can be abused by an attacker. Can he trick this with # in the path? Can he load a page with this code from an url that doesn't have .html? Or has it multiple times? What if he registers a domain like someattack.html.hisdomain.com and has a valid user visit it? :)
The url bar is a tricky thing, because it's user input, but the attacker doesn't have full control - he must hit an application page, otherwise this javascript won't be loaded. Still the reason this is flagged by a static scanner is because an attacker may have some control, and in case of javascript-heavy single page applications, probably a bit more because of all the url bar manipulation going on.
I've read about escaped fragments, but I don't think that it applies here because what I need to do is to route specific URL routes to certain actions on the same page in a SEO-friendly way.
Consider an example: a page has 30 posts in it. The markup is already there, no AJAX magic here. Once a user clicks a URL like example.com/#/test-post, I want to open a popup with the post contents (suppose that test-post is the post slug or any other content identifier).
This applies to posts, image galleries and pretty much any content that I want to show in a popup by matching a URL route to a certain Javascript action. The technical part is a piece of cake, but how would this perform SEO-wise? I understand that using separate pages for individual posts would probably be best, but is it possible to allow a single-page website to be crawled for individual posts so that the test-post accessed through example.com/#/test-post via Javascript ends up as a separate link in Google search results?
Using Hash Properties in order to do different things on the same webpage via JavaScript might be really useful at some situations. However, looking it from a SEO prespective, I don't thing is a great solution at all.
The reason of that is that the fragment identifier introduced by a hash mark # is the optional last part of a URL for a document. It is typically used to identify a portion of that document. As a result from a SEO prespective, only one page will be stored.
I would suggest you, to make usage of .htaccess and Friendly URLS to do so. For instance, this might look like this:
SEO friendly URL: `www.example.com/test-post`
window.onLoad = function(){
var URL = window.location.href;
switch(URL){
//Perfom different actions here
}
}
I know that I can find out the referrer using:
document.referrer;
However, I have a one page website, and a redirection set to send all other pages in that website to the home page. I would like to have a way of capturing the link that originated the redirection. In this case, document.referrer is always empty.
So I guess, I need to know:
How do I set a referrer parameter before the redirection?
How do I capture that parameter with JavaScript in the home page?
You could pass it along in a URL parameter. For example, Google does something similar when you click a search result; the browser actually goes to google.com/url?url=https%3A%2F%2Fthe-site-you-want.com.
So you could redirect your users to 'http://your-site.com/?referrer='+ encodeURIComponent(document.referrer), and then once it hits the homepage, you can extract that value and run decodeURIComponent.
encodeURIComponent is a method that makes a value safe to put in a URL. decodeURIComponent is the reverse process.
Alternatively, you could put it in a hash rather than the querystring, like 'http://your-site.com/#'+ encodeURIComponent(document.referrer). Several client-side routers use this. Although that may break the back button unless you spend more time learning about pushState. When Twitter first used twitter.com/#!/foo-bar as a URL scheme, it broke many things. But it may be useful to you.
I'm trying to post a link to various articles from my website to Reddit, but they all of the same root URL, but are differentiated by using the hashbang (#) to go to different articles. I wrote my front end in with a single page application framework (Ember.js), which defaults to using hashbangs to designate different pages. Thus, here are some examples of different blog posts:
http://noobjs.org/#/posts/15
http://noobjs.org/#/posts/16
They are different pages and different articles, but Reddit tells me that the link was already submitted since it must not view the hashbang as significant. Does anyone know if there is a way around this? The answer may be that I change my site so that it no longer uses the hashbang, but I'd rather avoid that so I don't break any other links I sent out.
Any ideas?
Reddit recognizes different query strings as being different pages. So, you can add a query string to the end that is the same as the hash.
http://noobjs.org/#/posts/15
http://noobjs.org/#/posts/16
become
http://noobjs.org/#/posts/15?/posts/15
http://noobjs.org/#/posts/16?/posts/16
It's not the prettiest, but it will work fine. Alternatively, you could write a check on page load against the URL to change ? into #.
window.location = window.location.href.replace("?", "#");
and post query string versions to reddit:
http://noobjs.org/?/posts/15
http://noobjs.org/?/posts/16
EDIT:
Currently, Ember does not have strong support for query parameters, but in this situation, a slight variation worked:
http://noobjs.org/?/#/posts/15
http://noobjs.org/?/#/posts/16
In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?
And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??
Thanks.
Basic Methodology
When the fetch event is triggered (for example on Facebook pasting a URL in) you can use AJAX to request the url*, then parse the returned data as you wish.
Parsing the data is the tricky bit, because so many websites have varying standards. Taking the text between the title tags is a good start, along with possibly searching for a META description (but these are being used less and less as search engines evolve into more sophisticated content based searches).
Failing that, you need some way of finding the most important text on the page and taking the first 100 chars or so as well as finding the most prominent picture on the page.
This is not a trivial task, it is extremely complicated trying to derive semantics from such a liquid and contrasting set of data (a generic returned web page). For example, you might find the biggest image on the page, that's a good start, but how do you know it's not a background image? How do you know that's the image that best describes that page?
Good luck!
*If you can't directly AJAX third party URL's, this can be done by requesting a page on your local server which fetches the remote page server side with some sort of HTTP request.
Some Extra Thoughts
If you grab an image from a remote server and 'hotlink' it on your site, many sites seem to sometimes have 'anti hotlinking' replacement images when you try and display this image, so it might be worth comparing the requested image from your server page with the actual fetched image so you don't show anything nasty by accident.
A lot of title tags in the head will be generic and non descriptive, it would be better to fetch the title of the article (assuming an article type site) if there is one available as it will be more descriptive, finding this is difficult though!
If you are really smart, you might be able to piggy back off Google for example (check their T&C though). If a user requests a certain URL, you can google search it behind the scenes, and use the returned google descriptive text as your return text. If google changes their markup significantly though this could break very quickly!
You can use a PHP server side script to fetch the contents of any web page (look up web scraping). What facebook does is it throws out a call to a PHP server side script via ajax which has a PHP function called
file_get_contents('http://somesite.com.au');
now once the file or webpage has been sucked into your server-side script you can then filter the contents for anything in particular. eg. Facebooks get link will look for the title, img and meta property="description parts of the file or webpage via regular expression
eg. PHP's
preg_match(); Function.
This can be collected then returned back to your webpage.
You may also want to consider adding extra functions for returning the data you want as scraping some pages may take longer than expected to return your desired information. eg. filter out irrelevant stuff like javascript, css, irrelavant tags, huge images etc. to make it run faster.
If you get this down pat you could potentialy be on your way to building a web search engine or better yet, collecting data off sites like yellowpages, eg. phone numbers, mailing addresses, etc.
Also you may want to look further into:
get_meta_tags('http://somesite.com.au');
:-)
There are several API's that can provide this functionality, for example PageMunch lets you pass in a url and callback so that you can do this from the client-side or feed it through your own server:
http://www.pagemunch.com
An example response for the BBC website looks like:
{
"inLanguage": "en",
"schema": "http:\/\/schema.org\/WebPage",
"type": "WebPage",
"url": "http:\/\/www.bbc.co.uk\/",
"name": "BBC - Homepage",
"description": "Breaking news, sport, TV, radio and a whole lot more. The BBC informs, educates and entertains - wherever you are, whatever your age.",
"image": "http:\/\/static.bbci.co.uk\/wwhomepage-3.5\/1.0.64\/img\/iphone.png",
"keywords": [
"BBC",
"bbc.co.uk",
"bbc.com",
"Search",
"British Broadcasting Corporation",
"BBC iPlayer",
"BBCi"
],
"dateAccessed": "2013-02-11T23:25:40+00:00"
}
You can always just look what it in the tag. If you need this in javascript it shouldn't be that hard. Once you have the data you can do:
var title = $(data).find('title').html();
The problem will be getting the data since I think most browsers will block you from making cross site ajax requests. You can get around this by having a service on your site which will act as a proxy and make the request for you. However, at that point you might as well parse out the title on the server. Since you didn't specify what your back-end language is, I won't bother to guess now.
It's not possible with pure JavaScript due to cross domain policy - client side script can't read contents of pages on other domains unless that other domain explicitly expose JSON service.
The trick is sending server side request (each server side language has its own tools), parse the results using Regular Expressions or some other string parsing techniques then using this server side code as "proxy" to AJAX call made "on the fly" when posting link.