Recently I've seen articles stating that Google now crawls sites and renders CSS and Javascript. Example article by Google themselves: http://googlewebmastercentral.blogspot.co.uk/2014/05/understanding-web-pages-better.html
I have a single page application setup in Angular with HTML5 mode on the routing. An ng-view in my index.html is populated based on the URL like so:
app.config(function($routeProvider, $locationProvider){
$locationProvider.html5Mode(true);
$routeProvider.when("/", {
templateUrl: "/views/dashboard.html"
}).when("/portfolio", {
templateUrl: "/views/portfolio.html"
});
});
Google should now go to www.example.com/portfolio, execute the javascript which brings in the content from views/portfolio.html and be able to read all that content right?
That's what should happen according to those articles I've read. This one in particular explains it in detail regarding Angular: https://weluse.de/blog/angularjs-seo-finally-a-piece-of-cake.html
Here's the problem. When I use Google Webmaster Tools and the Fetch, or Fetch and Render functionality to see how Google sees my pages, it doesn't render the JS and just shows the initial HTML in my index.html.
Is it working? Have I done something wrong? How can I test it?
So as I mentioned in the comments, hopefully this answer gives more context of what I mean.
So when you do your declaration of html5Mode, also include the hashPrefix:
$locationProvider
.html5Mode(true)
.hashPrefix('!');
Then, in your <head>, include this tag:
<meta name="fragment" content="!">
What happens here is that you are providing a fallback measure for the History API, meaning for all users visiting with compliant browsers (basically everything nowadays) they will see this:
http://example.com/home/
And only on dinosaur browsers like IE9 would they see this:
http://example.com/#!/home/
Now, that is in real life with actual people as visitors. You asked specifically about being indexed by Google, which uses bots. They will try to go to example.com/home/ as an actual destination on your server (meaning /home/index.html), which obviously doesn't exist. By providing the <meta> tag above, you have provided the hint to the bot to instead go to an ?_escaped_fragment version of the site (like index.html?_escaped_fragment=home) and associated it with that URL of /home/ in the actual Google searches.
It is entirely on the backend, all visitors to your site will still see the clean URL, and is only necessary because under the hood Angular uses location.hash, which is not seen on server-side. Bottom line, your actual users will be unaffected and not have the ugly URL, unless they're on a browser that does not support the History API. For those users, all you've done is make the site start working for them (as before it would have been broken).
Hope this helps!
UPDATE
Since you are using a MEAN stack, you can also go a different direction which has been around a long time, which is to use HTML snapshots. There are npms that will provide snapshots (meaning static HTML from post-render) that can be served up from your server at the locations shown. That technique is a little outdated, but its been around since like 2012 and is proven to work.
Back when I did it, i used grunt-html-snapshot, but there are others out there. You can even use PhantomJS to make the snapshots, although I never did it.
Related
I'm setting up A/B testing using Google Analytics Content Experiments Without Redirects (Browser-only implementation) and the tests will be applied in product pages which use a single Django template. The pages are rendered using standard render(request, template.html).
The problem is that apparently the experiments are per-page, so for each page that will be tested, a new experiment has to be created. Is that correct?
If yes, is there a workaround when using a single template (it includes the same <script> in all pages, so the experiment code would be replicated in all the pages that use the template, possibly causing issues with the analytics tracking)?
There are so many pages using that template, it would be difficult to create an experiment for each one.
I have recently taken an interest in software development. I have gotten pretty good by looking at source code, and visiting stack overflow regularly. I have since taken a liking to web applications due to their scalability. Because of this, I wanted to look at the source code of Facebook and Google in a web browser by clicking "View Source".
Funnily enough, clicking "View Source" on Google and Facebook does NOT show HTML Markup, but a full page of minified Javascript (at least i think its JS) instead. I have attached a screen shot to show what I mean. How does this work? From what I have learnt along the way, a browser requires HTML to display content. My Assumption is that large companies do this to protect their source. But how does a browser know what to display? And if a browser can display these sites properly, from what source code is it reading from?
I have tried to google this, but search terms such as "cant view Facebook source" or "Cant view Google Source" show me a bunch of un related results.
Is this a framework I have not heard of? Can anyone provide explanation on this. If these large companies are using these new methods, I would like to incorporate them into my own arsenal.
Screenshot of what is visible when you click "View Source" on the Google search results page:
You appear to not understand fundamentals of HTML and JavaScript and how they work together in a browser; consider this example from Wikipedia/JavaScript
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<button id="hellobutton">Hello</button>
<script>
document.getElementById('hellobutton').onclick = function() {
alert('Hello world!'); // Show a dialog
var myTextNode = document.createTextNode('Some new words.');
document.body.appendChild(myTextNode); // Append "Some new words" to the page
};
</script>
</body>
</html>
This is HTML markup with embedded JavaScript (you can embed CSS as well). What some people do, for various reasons, is minify the JavaScript to the point where almost everything that makes it human readable is removed -- of course the Javascript runtime (browser) doesn't care and will just as easily execute it regardless of whether it's minified or not.
HTML it self can be minified, but you can't really minify it anymore than removing line breaks, spaces and making it look like a single line without breaking the HTML's syntax/semantics like you can with JavaScript because HTML doesn't have variables.
Now consider the same example from Wikipedia, but minified (JavaScript and HTML)
<!DOCTYPE html><html> <head> <title>Example</title> </head> <body> <button id="hellobutton">Hello</button> <script>document.getElementById("hellobutton").onclick=function(){alert("Hello world!");var e=document.createTextNode("Some new words.");document.body.appendChild(e)};</script> </body></html>
Both are equally valid for the browser.
Additional Info
All the client side code needed to show you the website will always be visible to you. But recently, web developers have been inclined to use more and more JavaScript to add interactivity to their site or generate the HTML dynamically. The the latter case, you will often find that the 'View Source' page have nothing at all expect a script tag. You can use the Developer Tools on your browser to inspect this dynamically generated HTML, as well as inspect the various JavaScript scripts that have been loaded for that site.
Keep in mind, that developers can and will often minify their code making it difficult to view the JavaScript easily regardless of how you choose to inspect it.
Best place to see raw, unminified JavaScript if you're interested in on open source web projects on places like GitHub.
The webpage is html-rendered by using compiled source from javascript (Angular or Reactjs or Vue etc.) Using View source does not help in this case. You can right-click and choose Inspect element instead.
The mess of JavaScript that you're looking at is known as minification. Illegibility is indeed a by-product of minification, though the main purpose is to improve loading speeds. Because Facebook and Google are two of the most high-traffic websites in the world, they have to employ a umber of techniques in order to server up content faster.
Minification is performed using a Task Runner like Grunt or Gulp, and essentially does a few things:
Changes variables like useful_name into equally valid short variables like e.
Eliminates all whitespace.
Rewrites many functions into equal shorter functions.
For example:
var array = [];
for (var i = 0; i < 20; i++) {
array[i] = i;
}
Is equivalent to:
for(var a=[i=0];++i<20;a[i]=i);
Which obviously takes up much fewer bytes.
While the minification helps 'obfuscate' code, it does not improve security in the slightest, as the obfuscation can be completely decoded.
In addition to minification, it's also common practice to combine multiple different JavaScript or CSS files into one, using bundlers like Browserify, Brunch or webpack. Because of this, it can be quite difficult to work out what the code is really doing, though this can be aided by Prettyprinting the files by clicking on the {} icons in the bottom left of the relevant source.
Other common load-speed approaches include using Content Delivery Networks (CDNs), cutting down on HTTP requests, and using sprite mapping - all of which both of the above do in addition to their minification.
From what I have learnt along the way, a browser requires HTML to display content.
Check out this post about client side and server side rendering.
Google's front page actually is fully server-side rendered (all the HTML is present on "view source"), it just there's a lot of inlined Javascript before the page body and all of it is also minifed.
Facebook uses JavaScript much more heavily, most parts are written in React (their frontend framework), which is why you will barely see any plain HTML when inspecing Facebook's source. As Jun said, you are however able to inspect it with your browser's inspector after Javascript has rendered all of it.
My Assumption is that large companies do this to protect their source.
Not really, there's no "protecting" frontend code, it's just that client-side rendering has become much more popular and everybody minifies their source codes for bandwidth savings. On such a large scale (Facebook and Google), every saved byte counts. It might be harder to read, but nothing can actually be hidden or protected as, like you said, browsers need to render and execute the code client-side.
After reading this thread I decided to use pushstate api in my angularjs application which is fully API-based (independent frontend and independent backend).
Here is my test site: http://huyaks.com/index.html
I created a sitemap and uploaded to google webmaster tools.
From what I can see:
google indexed the main page, indexed the dynamic navigation (cool!) but did not index
any of dynamic urls.
Please take a look.
I examined the example site given in the related thread:
http://html5.gingerhost.com/london
As far as I can see, when I directly access a particular page the content which is presumed to be dynamic is returned by the server therefore it's indexed. But it's impossible in my case since my application is fully dynamic.
Could you, please, advise, what's the problem in my particular case and how to fix it?
Thanks in advance.
Note: this question is about pushState way. Please do not advise me to use escaped fragment or 3-d party services like prerender.io. I'd like to figure out how to use this approach.
Evidently Quentin didn't read the post you're referring to. The whole point of http://html5.gingerhost.com/london is that it uses pushState and proves that it doesn't require static html for the benefit of spiders.
"This site uses HTML5 wizrdry [sic] to load the 'actual content' asynchronusly [sic] to the rest of the code: this makes it faster for users, but it's still totally indexable by search engines."
Dodgy orthography aside, this demo shows that asynchronously-loaded content is indexable.
As far as I can see, when I directly access a particular page the content which is presumed to be dynamic is returned by the server
It isn't. You are loading a blank page with some JavaScript in it, and that JavaScript immediately loads the content that should appear for that URL.
You need to have the server produce the HTML you get after running the JavaScript and not depend on the JS.
Google does interpret Angular pages, as you can see on this quick demo page, where the title and meta description show up correctly in the search result.
It is very likely that if they interpret JS at all, they interpret it enough for thorough link analysis.
The fact that some pages are not indexed is due to the fact that Google does not index every page they analyze, even if you add it to a sitemap or submit it for indexing in webmaster tools. On the demo page, both the regular and the scope-bound link are currently not being indexed.
Update: so to answer the question specifically, there is no issue with pushState on the test site. Those pages simply do not contain value-adding content for Google. (See their general guidelines).
Sray, I recently opened up the same question in another thread and was advised that Googlebot and Bingbot do index SPAs that use pushState. I haven't seen an example that ensures my confidence, but it's what I'm told. To then cover your bases as far as Facebook is concerned, use open graph meta tags.
I'm still not confident about pushing forward without sending HTML snippets to bots, but like you I've found no tutorial telling how to do this while using pushState or even suggesting it. But here's how I imagine it would work using Symfony2...
Use prerender or another service to generate static snippets of all your pages. Store them somewhere accessible by your router.
In your Symfony2 routing file, create a route that matches your SPA. I have a test SPA running at localhost.com/ng-test/, so my route would look like this:
# Adding a trailing / to this route breaks it. Not sure why.
# This is also not formatting correctly in StackOverflow. This is yaml.
NgTestReroute:
----path: /ng-test/{one}/{two}/{three}/{four}
----defaults:
--------_controller: DriverSideSiteBundle:NgTest:ngTestReroute
--------'one': null
--------'two': null
--------'three': null
--------'four': null
----methods: [GET]
In your Symfony2 controller, check user-agent to see if it's googlebot or bingbot. You should be able to do this with the code below, and then use this list to target the bots you're interested in (http://www.searchenginedictionary.com/spider-names.shtml)...
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
If your controller finds a match to a bot, send it the HTML snippet. Otherwise, as in the case with my AngularJS app, just send the user to the index page and Angular will correctly do the rest.
Also, has your question been answered? If it has, please select one so I and others can tell what worked for you.
HTML snippets for AngularJS app that uses pushState?
I just did a proof of concept/demo for a web app idea I had but that idea needs to be embedded on pages to work properly.
I'm now done with the development of the demo but now I have to tweak it so it works within a tag on any websites.
The question here is:
How do I achieve this without breaking up the main website's stylesheets and javascript?
It's a node.js/socket.io/angularjs/bootstrap based app for your information.
I basically have a small HTML file, a few css and js files and that's all. Any idea or suggestions?
If all you have is a script tag, and you want to inject UI/HTML/etc. into the host page, that means that an iframe approach may not be what you want (although you could possibly do a hybrid approach). So, there are a number of things that you'd need to do.
For one, I'd suggest you look into the general concept of a bookmarklet. While it's not exactly what you want, it's very similar. The problems of creating a bookmarklet will be very similar:
You'll need to isolate your JavaScript dependencies. For example, you can't load a version of a library that breaks the host page. jQuery for example, can be loaded without it taking over the $ symbol globally. But, not all libraries support that.
Any styles you use would also need to be carefully managed so as to not cause issues on the host page. You can load styles dynamically, but loading something like Bootstrap is likely going to cause problems on most pages that aren't using the exact same version you need.
You'll want your core Javascript file to load quickly and do as much async work as possible as to not affect the overall page load time (unless your functionality is necessary). You'll want to review content like this from Steve Souders.
You could load your UI via a web service or you could construct it locally.
If you don't want to use JSONP style requests, you'll need to investigate enabling CORS.
You could use an iframe and PostMessage to show some UI without needing to do complex wrapping/remapping of the various application dependencies that you have. PostMessage would allow you to send messages to tell the listening iFrame "what to do" at any given point, while the code that is running in the host page could move/manipulate the iframe into position. A number of popular embedded APIs have used this technique over the years. I think DropBox was using it for example.
I am trying to save a couple of web pages by using a web crawler. Usually I prefer doing it with perl's WWW::Mechanize modul. However, as far as I can tell, the site I am trying to crawl has many javascripts on it which seem to be hard to avoid. Therefore I looked into the following perl modules
WWW::Mechanize::Firefox
MozRepl
MozRepl::RemoteObject
The Firefox MozRepl extension itself works perfectly. I can use the terminal for navigating the web site just the way it is shown in the developer's tutorial - in theory. However, I have no idea about javascript and therefore am having a hard time using the moduls properly.
So here is the source i like to start from: Morgan Stanley
For a couple of listed firms beneath 'Companies - as of 10/14/2011' I like to save their respective pages. E.g. clicking on the first listed company (i.e. '1-800-Flowers.com, Inc') a javascript function gets called with two arguments -> dtxt('FLWS.O','2011-10-14'), which produces the desired new page. The page I now like to save locally.
With perl's MozRepl module I thought about something like this:
use strict;
use warnings;
use MozRepl;
my $repl = MozRepl->new;
$repl->setup;
$repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")');
$repl->repl_enter({ source => "content" });
$repl->execute('dtxt("FLWS.O", "2011-10-14")');
Now I like to save the produced HTML page.
So again, the desired code I like to produce should visit for a couple of firms their HTML site and simply save the web page. (Here are e.g. three firms: MMM.N, FLWS.O, SSRX.O)
Is it correct, that I cannot go around the page's javascript functions and therefore cannot use WWW::Mechanize?
Following question 1, are the mentioned perl modules a plausible approach to take?
And finally, if you say the first two questions can be anwsered with yes, it would be really nice if you can help me out with the actual coding. E.g. in the above code, the essential part which is missing is a 'save'-command. (Maybe using Firefox's saveDocument function?)
The web works via HTTP requests and responses.
If you can discover the proper request to send, then you will get the proper response.
If the target site uses JS to form the request, then you can either execute the JS,
or analyse what it does so that you can do the same in the language that you are using.
An even easier approach is to use a tool that will capture the resulting request for you, whether the request is created by JS or not, then you can craft your scraping code
to create the request that you want.
The "Web Scraping Proxy" from AT&T is such a tool.
You set it up, then navigate the website as normal to get to the page you want to scrape,
and the WSP will log all requests and responses for you.
It logs them in the form of Perl code, which you can then modify to suit your needs.