Does html5mode(true) affect google search crawlers - javascript

I'm reading this specification which is an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers.
It's stated there that in order for a crawler to index html5 application one must implement routing using #! in URLs. In angular html5mode(true) we get rid of this hashed part of the URL. I'm wondering whether this is going to prevent crawlers from indexing my website.

Short answer - No, html5mode will not mess up your indexing, but read on.
Important note: Both Google and Bing can crawl AJAX based content without HTML snapshots
I know, the documentation you link to says otherwise but about a year or two ago they officially announced that they handle AJAX content without the need for HTML snapshots, as long as you use pushstates, but a lot of the documentation is old and unfortunately not updated.
SEO using pushstates
The requirement for AJAX crawling to work out of the box is that you are changing your url using pushstates. This is just what html5mode in Angular does (and also what a lot of other frameworks do). When pushstates is on the crawlers will wait for ajax calls to finish and for javascript to update the page before they index it. You can even update things like page-title or even meta tags in your router and it will index properly. In essence you don't need to do anything, there is no difference between server-side and client-side rendered sites in this case.
To be clear, a lot of SEO-analysis tools (such as Moz) will spit out warnings on pages using pushstates. That's because those tools (and their reps if you talk to them) are at the time of writing not up to date, so ignore them.
Finally, make sure you are not using the fragment meta-tag from below when doing this. If you have that tag the crawlers will think that you want to use the non-pushstates method and things might get messed up.
SEO without pushstates
There is very little reason not to use pushstates with Angular, but if you don't you need to follow the guidelines linked to in the question. In short you create snapshots of the html on your server and then you use the fragment meta tag to change your url-fragment to be "#!" instead of "#".
<meta name="fragment" content="!" />
When a crawler finds a page like this it will remove the fragment part of the url and instead requests the url with the parameter _escaped_fragment_, and you can serve your snapshotted page in response. Giving the crawler a normal static page to index.
Note that the fragment meta-tag should only be used if you want to trigger this behaviour. If you are using pushstates and want the page to index that way, don't use this tag.
Also, when using snapshots in Angular you can have html5mode on. In html5mode the fragment is hidden but it is still technically exists and will still trigger the same behaviour, assuming the fragment meta-tag is set.
A warning - Facebook crawler
While both Google and Bing will crawl your AJAX pages without problem (if you are using pushstates), Facebook will not. Facebook does not understand ajax-content and still requires special solutions, like html snapshots served specifically to the facebook bot (user agent facebookexternalhit/1.1).
Edit - I should probably mention that I have deployed sites with all of these versions. Both with html5mode, fragment meta tag and snapshots and without any snapshots and just relying on the pushstate-crawling. It all works fine, except for pushstates and Facebook as noted above.

To allow indexing of your AJAX application, you have to add special meta tag in the head section of your document:
<meta name="fragment" content="!" />
Source:
https://docs.angularjs.org/guide/$location#crawling-your-app
Towards the bottom look for crawling your app

Related

Should AJAX use hashtag /#!/ or not?

I've made a webpage that has the URL-form http://www.example.com/module/content
It's a very dynamic webpage, actually it is a web app.
To make it as responsive as possible, I want to use AJAX instead of normal page requests. This is also enabling me to use JavaScript to add a layer to provide offline capabilities.
My question is only: How should I make the URLs? Should they be http://www.example.com/module/content or http://www.example.com/#!/module/content?
Following is only my thoughts in both directions. You don't need to read it if you already have a clear thought about this.
I want to use the first version because I want to support the new HTML5 standard. It is easy to use, and the URLs look pretty. But more importantly is that it allows me to do this:
If the user requests a page, it will get a full HTML page back.
If the user then clicks a link, it will insert only the contents into the container div via AJAX.
This will enable users without JavaScript to use my website, since it does not REQUIRE the use to have JavaScript, it will simply use the plain old "click a link-request-get full html page back"-approach.
While this is great, the problem is of course Internet Explorer. Only the last version of IE supports History API, and to get this working in older versions, one needs to use a hashtag. (50 % of my users will be using IE, so I need to support it...) So then I have to use /#!/ to get it working.
If one uses both these URL-versions, the problem arises that if a IE user posts this link to a website, Google will send the server a _unescaped_string (or similar..) And it will index the page with the IE-version with the hashtag. And some pages will be without the hashtag.
And as we remember, a non-hashtag is better on many different things. So, can this search engine problem be circumvented? Is it possible to tell the GoogleBot that if it's reaching for the hashtag-version of the website, it should be redirected to the non-hashtag-version of the webpage? This way, one could get all the benefits of a non-hashtag URL and still support IE6-IE9.
What do you think is the best approach? Maybe you have tried it in practice yourself?
Thanks for your answer!
If you want Google to index your Ajax content then you should use the #!key=value like this. That is what Google prefers for Ajax navigation.
If you really prefer the pretty HTML5 url without #! then, yes, you can support both without indexing problems! Just add:
<link rel="canonical" href="preferredurl" />
to the <head> section of each page (for the initial load), so to help Google know which version of the url you would prefer them index. Read more about canonical urls here.
In that case the solution is very easy. You use the first URL scheme, and you don't use AJAX enhancements for older IE browsers.
If your precious users don't want to upgrade, it's their problem, but they can't complain about not having these kewl effects and dynamics.
You can throw a "Your browser is severely outdated!" notice for legacy browsers as well.
I would not use /#!/ in the url. First make sure the page works normally, with full page requests (that should be easy). Once that works, you can check for the window.history object and if that is present add AJAX. The AJAX calls can go to the same url and the main difference is the server side handling. The check is simple, if the HTTP_X_REQUESTED_WITH is set then the request is an AJAX request and if it is not set then you're dealing with a standard request.
You don't need to worry about duplicate content, because GoogleBot does not set the HTTP_X_REQUESTED_WITH request header.

Twitter Cards using Backbone's HTML5 History

I'm working on a web app which uses Backbone's HTML5 History option. In order to avoid having to code everything on the client and on the server, I'm using this method to route every request to index.html
I was wondering if there is a way to get Twitter Cards to work with this setup, as currently it can't read the page as everything is loaded in dynamically with Javascript.
I was thinking about using User Agents to detect whether it's the TwitterBot, and if it is, serving a static version of the page with the required meta-tags. Would this work?
Thanks.
Yes.
At one job we did this for all the SEO/search/facebook stuff etc.
We would sniff the user-agent, and if it was one of the following sniffers
Facebook Open Graph
Google
Bing
Twitter
Yandex
(a few others I can't remember)
we would redirect to a special page that was written to dump all the relevant data about the page for SEO purposes into a nicely formatted (but completely unstyled) page.
This allowed us to retain our google index position and proper facebook sharing even though our site was a total single-page app in backbone.
Yes, serving a specific page for Twitterbot with the right meta data markup will work.
You can test your results while developing using the card's preview tool.
https://dev.twitter.com/docs/cards/preview (with your static URL or just the tags).

#! hashtag and exclamation mark in links as folder?

how i can make my pages show like grooveshark pages
http://grooveshark.com/#!/popular
is there a tutorial or something to know how to do this way for showing page by jQuery or JavaScript?
The hash and exclamation mark in a url are called a hashbang, and are usualy used in web applications where javascript is responsible for actually loading the page. Content after the hash is never sent to the server. So for example if you have the url example.com/#!recipes/bread. In this case, the page at example.com would be fetched from the server, this could contain a piece of javascript. This script can then read from location.hash, and load the page at /recipes/bread.
Google also recognizes this URL scheme as an AJAX url, and will try to fetch the content from the server, as it would be rendered by your javascript. If you're planning to make a site using this technique, take a look at google's AJAX crawling documentation for webmasters. Also keep in mind that you should not rely on javascript being enabled, as Gawker learned the hard way.
The hashbang is being going out of use in a lot of sites, evenif javascript does the routing. This is possible because all major browsers support the history API. To do this, they make every path on the site return the same Javascript, which then looks at the actual url to load in content. When the user clicks a link, Javascript intercepts the click event, and uses the History API to push a new page onto the browser history, and then loads the new content.

How can I give users a javascript widget to pull content securely from my site

I need to be allow content from our site to be embeded in other users web sites.
The conent will be chargeable so I need to keep it secure but one of the requirements is that the subscribing web site only needs to drop some javascript into their page.
It looks like the only way to secure our content is to check the url of the page hosting our javascript matches the subscribing site. Is there any other way to do this given that we don't know the client browsers who will be hitting the subscribing sites?
Is the best way to do this to supply a javascript include file that populates a known page element when the page loads? I'm thinking of using jquery so the include file would first call in jquery (checking if it's already loaded and using some sort of namespace protection), then on page load populate the given element.
I'd like to include a stylesheet as well if possible to style the element but I'm not sure if I can load this along with the javascript.
Does this sound like a reasonable approach? Is there anything else I should consider?
Thanks in advance,
Mike
It looks like the only way to secure our content is to check the url of the page hosting our javascript matches the subscribing site.
Ah, but in client-side or server-side code?
They both have their disadvantages. Doing it with server-side code is unreliable because some browsers won't be passing a Referer header at all, and if you want to stop caches keeping a copy of the script, preventing the Referer-check from taking place, you have to serve with nocache or Vary: Referer headers, which would harm performance.
On the other hand, with client-side checks in the script you return, you can't be sure your environment you're running in hasn't been sabotaged. For example if your inclusion script tag was like:
<script src="http://include.example.com/includescript?myid=123"></script>
and your server-side script looked up 123 as being the ID for a customer using the domain customersite.foo, it might respond with the script:
if (location.host.slice(-16)==='customersite.foo') {
// main body of script
} else {
alert('Sorry, this site is not licensed to include content from example.com');
}
Which seems simple enough, except that the including site might have replaced String.prototype.slice with a function that always returned customersite.foo. Or various other functions used in the body of the script might be suspect.
Including a <script> from another security context cuts both ways: the including-site has to trust the source-site not to do anything bad in their security context like steal end-user passwords or replace the page with a big goatse; but equally, the source-site's code is only a guest in the including-site's potentially-maliciously-customised security context. So a measure of trust must exist between the two parties wherever one site includes script from another; the domain-checking will never be a 100% foolproof security mechanism.
I'd like to include a stylesheet as well if possible to style the element but I'm not sure if I can load this along with the javascript.
You can certainly add stylesheet elements to the document's head element, but you would need some strong namespacing to ensure it didn't interfere with other page styles. You might prefer to use inline styles for simplicity and to avoid specificity-interference from the page's main style sheet.
It depends really whether you want your generated content to be part of the host page (in which case you might prefer to let the including site deal with what styles they wanted for it themselves), or whether you want it to stand alone, unaffected by context (in which case you would probably be better off putting your content in an <iframe> with its own styles).
I'm thinking of using jquery so the include file would first call in jquery
I would try to avoid pulling jQuery into the host page. Even with noconflict there are ways it can conflict with other scripts that are not expecting it to be present, especially complex scripts like other frameworks. Running two frameworks on the same page is a recipe for weird errors.
(If you took the <iframe> route, on the other hand, you get your own scripting context to play with, so it wouldn't be a problem there.)
You can store the users domain, and a key within your local database. That, or the key can be an encrypted version of the domain to keep you from having to do a database lookup. Either one of these can determine whether you should respond to the request or not.
If the request is valid, you can send your data back out to the user. This data can indeed load in jQuery and and additional CSS reference.
Related:
How to load up CSS files using Javascript?
check if jquery has been loaded, then load it if false

Embedding web controls across sites - using IFrames/Javascript

I'm just looking for clarification on this.
Say I have a small web form, a 'widget' if you will, that gets data, does some client side verification on it or other AJAX-y nonsense, and on clicking a button would direct to another page.
If I wanted this to be an 'embeddable' component, so other people could stick this on their sites, am I limited to basically encapsulating it within an iframe?
And are there any limitations on what I can and can't do in that iframe?
For example, the button that would take you to another page - this would load the content in the iframe? So it would need to exist outwith the iframe?
And finally, if the button the user clicked were to take them to an https page to verify credit-card details, are there any specific security no-nos that would stop this happening?
EDIT: For an example of what I'm on about, think about embedding either googlemaps or multimap on a page.
EDIT EDIT: Okay, I think I get it.
There are Two ways.
One - embed in an IFrame, but this is limited.
Two - create a Javascript API, and ask the consumer to link to this. But this is a lot more complex for both the consumer and the creator.
Have I got that right?
Thanks
Duncan
There's plus points for both methods. I for one, wouldn't use another person's Javascript on my page unless I was absolutely certain I could trust the source. It's not hard to make a malicious script that submits the values of all input boxes on a page. If you don't need access to the contents of the page, then using an iframe would be the best option.
Buttons and links can be "told" to navigate the top or parent frame using the target attribute, like so:
This is a link
<form action="http://some.url/with/a/page" target="_parent"><button type="submit">This is a button</button></form>
In this situation, since you're navigating away from the hosting page, the same-origin-policy wouldn't apply.
In similar situations, widgets are generally iframes placed on your page. iGoogle and Windows Live Gadgets (to my knowlege) are hosted in iframes, and for very good reason - security.
If you are using AJAX I assume you have a server written in C# or Java or some OO language.
It doesn't really matter what language only the syntax will vary.
Either way I would advise against the iFrame methods.
It will open up way way too many holes or problems like Http with Https (or vice-versa) in an iFrame will show a mixed content warning.
So what do you do?
Do a server-side call to the remote site
Parse the response appropriately on the server
Return via AJAX what you need
Display returned content to the user
You know how to do the AJAX just add a server-side call to the remote site.
Java:
URL url = new URL("http://www.WEBSITE.com");
URLConnection conn = url.openConnection();
or
C#:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://www.WEBSITE.com");
WebResponse res = req.GetResponse();
I think you want to get away from using inline frames if possible. Although they are sometimes useful, they can cause issues with navigation and bookmarking. Generally, if you can do it some other way than an iframe, that is the better method.
Given that you make an AJAX reference, a Javascript pointer would probably be the best bet i.e. embed what you need to do in script tags. Note that this is how Google embed things such as Google Analytics and Google Ads. It also has the benefit of also being pullable from a url hosted by you, thus you can update the code and 'voila' it is active in all the web pages that use this. (Google usually use version numbers as well so they don't switch everyone when they make changes).
Re the credit card scenario, Javascript is bound by the 'same origin policy'. For a clarification, see http://en.wikipedia.org/wiki/Same_origin_policy
Added: Google Maps works in the same way and with some caveats such as a user/site key that explicitly identify who is using the code.
Look into using something like jQuery, create a "plugin" for your component, just one way, and just a thought but if you want to share the component with other folks to use this is one of the things that can be done.

Categories

Resources