Generate Static HTML From Client-Side JavaScript Generated Site - javascript

I'm generating an entire site using just an index.html with JS scripts.
The JS creates the HTML content based on JSON data received via the server-side API. This works great client-side and makes the site load speed and interaction very fast but there is a snag... when a crawler comes to index the page it will see a blank page.
The obvious solution is to provide an XML site map with static versions of all the pages. The problem is... how to generate static versions of each page when they are only generated client-side and all logic and templates are client-side?
This is not a new issue... I'm sure anyone generating pages dynamically client-side has hit this issue and solved it but I thought I'd ask the dev community before diving in and trying to solve this.

2019 Update
Tech has moved on significantly. I would encourage anyone looking to create SSR (server-side rendered) and client-side web apps in one isomorphic code base to take a look at the excellent Next.js.
Next.js wraps React with a server-side routing and rendering system built in Node.js, defines a standard interface to getting data for pages on server and client, and comes with some out of the box features that make it one of the best choices (IMHO) for both SSR and CSR web applications.
Oh... and they have a great tutorial too!
2013 Answer
I've managed to generate static pages from the client-side output by using PhantomJS and capturing the HTML output after the page and all JS has finished loading/executing. This method is slower than I would like and unlikely to scale well but it's the only option that I can think of so far.
The site already receives over 10,000 page views a day with over 8,000 unique visitors so pages get updated regularly as new comments / posts are created and then these changes are added to a queue which gets process in a separate server to generate static pages with Phantom.
The only other way I can think of doing this is to create a Node.js process that uses the same jsRender library and builds HTML output from the template files based on some data, but this would be time consuming to set up and would not generate the exact same output that the dynamic site creates. Google may frown on me serving it static pages that don't really represent the dynamic version that "normal" visitors can see.
This seems like an unsolvable issue. Either I generate the pages entirely server-side, or crawlers cannot index the pages. :(

Related

Advice for a web application which references images/fonts/files/etc

I have been working on an interactive, static web application project at work, and I've completed everything I want, with a few tiny last hurdles. I've written everything in HTML, CSS and JS with no frameworks. Everything runs smoothly client-side, and I even put it on Github Pages and it works well there too. The issue is this: currently, I have some data hard-coded into a String in the JS file, and I would rather the JS reference some kind of excel sheet (.csv, .xlxs or Google sheet) or text file with the data.
I know that this violates security policies when the user does not directly select the file (via input), so I was thinking about using Google Sites (which, if I'm not mistaken, can reference an embedded Google Sheet?). However, I'm fairly certain that this doesn't give me the freedom of using everything that I want to (i.e. images, custom fonts, as well as css and js).
All of the code is in this repo, but I don't think that showing code is helpful to what I'm asking.
I guess the real question is, we want the program to be continuable for the future, and that people other than me (who don't necessarily know how the code works) can update the data. Bottom line, we don't want that data hard-coded, and we need a solution that encompasses everything in the directory.
Side note, I would like to try to stay away from servers and databases, but if that might be my only solution, then I guess I have no choice.
Security policies are only an issue if your web application is being ran over file:// protocol - so long as you have any kind of minimal web server (app being static makes this easier) to run it, or host the app on GitHub pages, you'll be able to pull your data files from relative paths via Ajax.
Using CSV is probably the most straightforward choice here - virtually every sheet-related application can import/export the format, and the overhead to parse it on JS side is minimal.

Are React / JSX Generated HTML Elements Invisible to Google Web Crawlers?

I recently read an article about how HTML elements created with JavaScript are not picked up by the Googlebot / Google crawlers. The reason being, in its most simple form, the HTML the GoogleBot picks up is everything that is shown when you do View Page Source.
I'm about to start learning React, one of the reasons for this being that you can create template files and components, so common features such as headers and footers etc and be duplicated easily to keep your code DRY.
It worries me though that if I was to do this, the React / JSX generated HTML would effectively not be tracked by web crawlers, thus making it essentially invisible, which would create a large number of potential negatives, not least, inferior SEO.
My question therefore is - does HTML generated with React behave in the same way HTML generated with vanilla JavaScript does? I'm assuming it must do, but I can't find any proper answers to this when googling?
Many thanks,
Emily.
Reactjs is an isomorphic or Universal or environment agnostic.
You can build client side application and also server side applications.
As you are already aware of client side of it. Now you checkout the server side implementation in the below tutorial
https://scotch.io/tutorials/react-on-the-server-for-beginners-build-a-universal-react-and-node-app
you can also checkout the following boilerplates which provide SSR :
https://github.com/erikras/react-redux-universal-hot-example
https://github.com/TimoRuetten/react-ssr-boilerplate

PHP template system vs javascript AJAX template

My PHP template looks like this:
$html=file_get_contents("/path/to/file.html");
$replace=array(
"{title}"=>"Title of my webpage",
"{other}"=>"Other information",
...
);
foreach(replace AS $search=>$replace){
$html=str_replace($search,$replace,$html);
}
echo $html;
I am considering switching to a javascript/ajax template system. The AJAX will fetch the $replace array in JSON format and then I'll use javascript to replace the HTML.
The page would then be a plain .html file and a loading screen would be shown until the ajax was complete.
Is there any real advantages to this or is the transition a waste of time?
A few of the reasons I think this will be beneficial:
Page will still load even if the Mysql or PHP services are down. If the ajax fails I can handle it with an error message.
Bot traffic (and anything else that doesnt run JS) will cause very little load to my server since the ajax will never be sent.
Please let me know what your thoughts are.
My 2cents is it is better to do the logic on the template side (javascript). If you have a high traffic site you can off load some of the processing to each computer calling the site. Maybe less servers.
With Javascript frameworks like AngularJs the template stuff is pretty simple and efficient. And the framework will do caching for you.
Yes, SEO can be an issue with certain sites. There at proxy tools you can put in place that will render the site and return the static html to the bot. Plus I think some bots render javascript these days.
Lastly, I like to template on the front-end because I like the backend to be a generic data provider (RESTful API). This way I can build a generic backend that drives web / mobile and other platforms in a generic way. The UI logic can be its separate thing in javascript.
But it comes down to the design needs of your application. I build lots of Software as a service applications so a single page application works well for me.
I've worked with similar design pattern in other projects. There are several ways to do this and the task would involve managing multiple project or application modules. I am assume you are working with a team of developers and not using either PHP or JavaScript MVC framework.
PHP Template
For many reasons, I'm against using “search and replace” method especially using server-side scripting language to parse HTML documents as a templating kit.
Why?
As you maintain business rules and project becomes larger, you will
find yourself reading through a long list of regular expressions,
parse HTML into DOM, and/or complicated algorithms for searching
nodes to replace with correct text(s).
If you had a placeholder, such as {title}, that would help the
script to have fewer search and replace expressions but the design
pattern could lead to messy sharing with multiple developers.
It is ok to parse one or two HTML files to manage the output but not
the entire template. The network response could be slower with
multiple and repetitive trips to server and that's just only for
template. There could be other scripts that is also making trips to
the server for different reason unrelated to template.
AJAX/JavaScript
Initially, AJAX with JavaScript might sound like a neat idea but I'm still not convinced.
Why?
You can't assume web browser is JavaScript-enabled in every mobile
or desktop. You might need to structure the HTML template in few
ways to manage the output for non-JavaScript browsers. You might
need to include <noscript> and/or <iframe> tags on every page. And,
managing alternative template for non-JavaScript browser can be
tedious.
Every web browser interpret JavaScript differently. Most developers
should know few differences between IE, FireFox, Chrome, Safari, and
to name few. You might need to create multiple JavaScript files to
detect then load JavaScript for that specific web browser. You
update one feature, you have to update script for all web browsers.
JavaScript is visible in page source. I wouldn't want to display
confidential JavaScript functions that might include credentials,
leak sensitive data about web services, and/or SQL queries. The idea
is to secure your page as much as possible.
I'm not saying both are impossible. You could still do either PHP or JavaScript for templating.
However, my “ideal” web structure should consist of a reliable MVC like Zend, Spring, or Magnolia. Those MVC framework include many useful features such as web services, data mapping, and templating kits. Granted, it's difficult for beginners with configuration requirements to integrate MVC into your project. But in the end, you could delegate tasks in configurations, MVC concepts, custom SQL queries, and test cases to developers. That's my two cents.
I think the most important aspects you forgot are:
SEO : What about search engine bots ? They wont be able to index your content if it is set by javascript only.
Execution and Network Latency : When your service is working, the browser will wait until the page is loaded (let's say 800ms) before making the extra Ajax calls to get your values. This might add an extra 500ms to get it (depending on network speed and geographic location...). If you have sent all the generated data by your server, you would have spent only ~1ms more to prepare the complete response. You would have a lot of waiting on a blank page.
Caching : You could cache the generated pages on your web app. That way your load will be minimized as well. And also, if you still want to deliver content while your backend services (MySQL/PHP..) are down you could even use Apache or Nginx caching.
But I guess it really depends on what you want to do.
For fast and simple pages, which seems to be your case, stick with backend enhancements.
For a dynamic/interactive app which can afford loading times, and doesn't care about SEO, you can delegate most things to the front-end. But then use an advanced framework like Angular, to handle templating, caching, etc...

Handlebars.js and SEO

I have read a great deal of discussions about javascript templating and Search Engine Optimization. Still, I haven't found a satisfying answer to the question (either poorly-documented or outdated).
Currently I am looking into handlebars.js as a client-side template solution, because I love the possibility to create helper functions. But what about indexing for search engines? Does the bot index the generated content (as intended) or only the source with the ugly javascript pseudo-variables? I know that there are lots of threads going on about this matter but I feel that nobody does exactly know the answer.
If engines like Google would not index these templates properly, why would one bother using this for public websites?
Another question within this context: Is it possible to render Handlebar.js templates on server side and then present them onto the client side? Obviously to avoid all this SEO discussion.
For dom crunching client side, most web bots (i.e. Google and others) don't interpret js on the fly and parse newly rendered content for indexing. Instead Google (and now Bing) support the 'Google Ajax Crawling Scheme' (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started) - which basically states that IF you want js rendered dom content to be indexed (i.e. rendering ajax call results), you must be able to:
Trigger the async js rendering via the url using hashbangs #! (i.e. http://www.mysite.com/#!my-state), and
Be able to serve a prerendered dom snapshot of your site AFTER js modification on request.
If using a client side MVC framework like Backbone.js, or Spine - you will need to provide this service if you want your web app indexed.
Generally this means you intercept a request made by the web bot (explained on the link above), and scrape your side server side using a headless browser (i.e. QT + capybara-webkit, HtmlUnit, etc.), then deliver the generated dom back to the requesting bot.
I've started a gem to do this in ruby (now taking pull requests) at https://github.com/benkitzelman/google-ajax-crawler
It does this as rack middleware using capybara-webkit (and soon phantomjs)
I do not know about Handlebar.js, but for my understanding SEO have some problem with CONTENT in JAVASCRIPT. Make sure your content is visible to Search Engine (use a spyder simulator for some test). Avoid spyder traps generally would be the way to go. Hope it could help you.
Search engines don't run JavaScript, so if you want to have your content indexed you'll need to render your templates on the server as well. You can use handlebars in Node (server-side JS) to render your template there when the page request comes from a spider. It's more work but it's possible. Github, google plus, and twitter all do something similar.
You could use Distal templates which puts the templates as part of the HTML for SEO.
See Spiderable for a temporary solution Meteor project (which uses Handlebars.js) uses for SEO purposes.
http://docs.meteor.com/#spiderable
Does the bot index the generated content (as intended) or only the source with the ugly javascript pseudo-variables?
Neither, because indexer bots don't run JavaScript and you don't serve up templates as HTML documents.
Build a site that works without JavaScript, then build on top of it.

Dual-Side Templating vs. Server-Side DOM Manipulation

I'm making an app that requires dynamic content be fully rendered on the page for search engine bots - a problem, potentially, should I use JS templating to control the content. Web spiders are supposedly getting better at indexing RIA sites, but I don't want to risk it. Also, as mobile internet is still spotty in most places, it seems like a good practice to maximize the server load initially to ensure that basic functionality/styles/dynamic content show up on your pages, even if the client hasn't downloaded any JS libraries.
That's how I stumbled upon dual-side templating:
Problem: How can you allow for dynamic, Ajax-style, rendering in the browser, but at the same time output it from the server upon initial page load?
c. 2010: Dual-Side Templating A single template is used on both browser and server, to render content wherever it’s appropriate – typically the server as the page loads and the browser as the app progresses. For example, blog comments. You output all existing comments from the server, using your server-side template. Then, when the user makes a new comment, you render a preview of it – and the final version – using browser-side templating.
I want to try dual-side templating with Node.js and Eco templates, but I don't know how to proceed. I'm new to JavaScript and all things Node.
Node-Lift is said to help, but I don't understand what it's doing or why.
Can someone provide a high level overview of how you might use dual-templating in the context of a mobile web app?
Where does server-side DOM manipulation with jQuery and JSDOM fit in to the equation?
TIA
Dav Glass gave a great talk about this last year: http://www.youtube.com/watch?v=bzCnUXEvF84
And here is a blog article that goes over some of the details: http://www.yuiblog.com/blog/2010/04/09/node-js-yui-3-dom-manipulation-oh-my/

Categories

Resources