Are React / JSX Generated HTML Elements Invisible to Google Web Crawlers? - javascript

I recently read an article about how HTML elements created with JavaScript are not picked up by the Googlebot / Google crawlers. The reason being, in its most simple form, the HTML the GoogleBot picks up is everything that is shown when you do View Page Source.
I'm about to start learning React, one of the reasons for this being that you can create template files and components, so common features such as headers and footers etc and be duplicated easily to keep your code DRY.
It worries me though that if I was to do this, the React / JSX generated HTML would effectively not be tracked by web crawlers, thus making it essentially invisible, which would create a large number of potential negatives, not least, inferior SEO.
My question therefore is - does HTML generated with React behave in the same way HTML generated with vanilla JavaScript does? I'm assuming it must do, but I can't find any proper answers to this when googling?
Many thanks,
Emily.

Reactjs is an isomorphic or Universal or environment agnostic.
You can build client side application and also server side applications.
As you are already aware of client side of it. Now you checkout the server side implementation in the below tutorial
https://scotch.io/tutorials/react-on-the-server-for-beginners-build-a-universal-react-and-node-app
you can also checkout the following boilerplates which provide SSR :
https://github.com/erikras/react-redux-universal-hot-example
https://github.com/TimoRuetten/react-ssr-boilerplate

Related

Using React (with Redux) as a component in a website

I have a large, globalised web site (not a web app), with 50k+ pages of content which is rendered on a cluster of servers using quite straightforward NodeJS + Nunjucks to generate HTML. For 90% of the site, this is perfectly acceptable, and necessary to achieve SEO visibility, particularly in non-Google search engines which don't index JS well (Yandex, Baidu, etc)
The site is a bit clunky as complexity has increased over time, and I'd like to re-architect some of the functional components that are built mostly using progressively enhanced jQuery as they are quite clunky. I've been looking at React for this with the Redux implementation of the Flux pattern.
Now my question is simply around the following - nearly 100% of the tutorials assume I'm building some sort of SPA, which I'm not. I just want to build a set of containerised reusable components that I can plug into replace the jQuery components. Oh, they have to be WCAG AA/508 accessible as well
Does React play well with being retrofitted into websites and are there any specific considerations around SEO, bootstrapping, accessibility? Examples of implementations or tutorials would be appreciated.
You can mount react component to any DOM Node on your page, so it makes it easy to insert components in statically generated content.
Most of search engines like google would wait for js files to load before they index the page so it will index a page with react component perfectly fine. However if you want to be 100% sure that your page rendered correctly by all crawling bots you have to take a look at react server rendering. If you already use NodeJS for a backend it should not be a big problem.
I never encountered with that kind of problem but my best guess would be to use ReactDOMServer.renderToString to render component on the server and then replace a node in your static html layout. The implementation would depend on you template lang you use. You can use something like handlebars to dynamically create halpers from React Components. So in your static html page you would be able to use them as {{my-component}} But it's only my speculations on that subject, may be there is more elegant solution.
Here is the article that could help.
You'll be happy to know that this is all possible through something called isomorphic javascript. Basically you'll just use React and jsx to render HTML on the server which is then sent to the browser as a fully built web page. This does not assume your app is an SPA, rather that you'll have multiple endpoints for rendering different pages, much like you already have probably.
The benefit here is that you can use the React/Redux architecture but still allow you site to be indexable by crawlers, as requests to your app will yield static pages, not a single page with lots of JS to make it work. You're also free to gradually refactor by converting your Nunjucks rendered endpoints to React one at a time, instead of a big jump to SPA land.
Here's a good tutorial I found on making isomorphic React apps with node:
https://strongloop.com/strongblog/node-js-react-isomorphic-javascript-why-it-matters/
EDIT: I may have misread your actual desire which is to inject React components into your existing web pages. This is also possible, you'll probably want to use ReactDOM to render your components to static markup, and then you can inject that markup string into your Nunjucks via templating.

React server-side render or static index.html?

How you can see in React manual (ReactDOMServer):
If you call ReactDOM.render() on a node that already has this
server-rendered markup, React will preserve it and only attach event
handlers, allowing you to have a very performant first-load
experience.
So does it mean that if I use static index.html in which I just include my react app js file I don't have to use server-side rendering?
Btw which of react-app architecture better for SEO?
Thanks for you answers!
In theory, it's true that you can use static index.html. React will try to render the page on the client side and update your html. This has become much easier to do with React 15 as you no longer need to maintain data-reactid attributes.
Nonetheless, I'd recommend using SSR (server side rendering) because it makes life easier. Granted, it takes effort to set up but it's beneficial. You also get to make use of server side routing, critical path css, and more.
If you want SEO, universal apps are the way to go. Two excellent architectures are:
React redux universal hot example
React starter kit
Good luck!
So does it mean that if I use static index.html in which I just
include my react app js file I don't have to use server-side
rendering?
Of course. You can certainly use React purely on the client without any need for server rendering. However server side rendering can be beneficial for graceful degradation. It also helps from a usability perspective as your user won't have to wait for javascript to be downloaded and executed before any content can be shown.
Btw which of react-app architecture better for SEO?
Now search engines have significantly matured in their ability to crawl dynamic pages. However the support for javascript generated content is a work in progress in most engines. As Google Webmasters blog explains:
Sometimes the JavaScript may be too complex or arcane for us to execute, in which case we can’t render the page fully and accurately.
Some JavaScript removes content from the page rather than adding, which prevents us from indexing the content.
So from SEO perspective it is still better if you opt for server side rendering.

How do you use React.js for SEO?

Articles on React.js like to point out, that React.js is great for SEO purposes. Unfortunately, I've never read, how you actually do it.
Do you simply implement _escaped_fragment_ as in https://developers.google.com/webmasters/ajax-crawling/docs/getting-started and let React render the page on the server, when the url contains _escaped_fragment_, or is there more to it?
Being able not to rely on _escaped_fragment_ would be great, as probably not all potentially crawling sites (e.g. in sharing functionalities) implement _escaped_fragment_.
I'm pretty sure anything you've seen promoting React as being good for SEO has to do with being able to render the requested page on the server, before sending it to the client. So it will be indexed just like any other static page, as far as search engines are concerned.
Server rendering made possible via ReactDOMServer.renderToString. The visitor will receive the already rendered page of markup, which the React application will detect once it has downloaded and run. Instead of replacing the content when ReactDOM.render is called, it will just add the event bindings. For the rest of the visit, the React application will take over and further pages will be rendered on the client.
If you are interested in learning more about this, I suggest searching for "Universal JavaScript" or "Universal React" (formerly known as "isomorphic react"), as this is becoming the term for JavaScript applications that use a single code base to render on both the server and client.
As the other responder said, what you are looking for is an Isomorphic approach. This allows the page to come from the server with the rendered content that will be parsed by search engines. As another commenter mentioned, this might make it seem like you are stuck using node.js as your server-side language. While it is true that have javascript run on the server is needed to make this work, you do not have to do everything in node. For example, this article discusses how to achieve an isomorphic page using Scala and react:
Isomorphic Web Design with React and Scala
That article also outlines the UX and SEO benefits of this sort of isomorphic approach.
Two nice example implementations:
https://github.com/erikras/react-redux-universal-hot-example: Uses Redux, my favorite app state management framework
https://github.com/webpack/react-starter: Uses Flux, and has a very elaborate webpack setup.
Try visiting https://react-redux.herokuapp.com/ with javascript turned on and off, and watch the network in the browser dev tools to see the difference…
Going to have to disagree with a lot of the answers here since I managed to get my client-side React App working with googlebot with absolutely no SSR.
Have a look at the SO answer here. I only managed to get it working recently but I can confirm that there are no problems so far and googlebot can actually perform the API calls and index the returned content.
It is also possible via ReactDOMServer.renderToStaticMarkup:
Similar to renderToString, except this doesn't create extra DOM
attributes such as data-react-id, that React uses internally. This is
useful if you want to use React as a simple static page generator, as
stripping away the extra attributes can save lots of bytes.
There is nothing you need to do if you care about your site's rank on Google, because Google's crawler could handle JavaScript very well! You can check your site's SEO result by search site:your-site-url.
If you also care about your site's rank such as Baidu, and your sever side implemented by PHP, maybe you need this: react-php-v8js.

Generate Static HTML From Client-Side JavaScript Generated Site

I'm generating an entire site using just an index.html with JS scripts.
The JS creates the HTML content based on JSON data received via the server-side API. This works great client-side and makes the site load speed and interaction very fast but there is a snag... when a crawler comes to index the page it will see a blank page.
The obvious solution is to provide an XML site map with static versions of all the pages. The problem is... how to generate static versions of each page when they are only generated client-side and all logic and templates are client-side?
This is not a new issue... I'm sure anyone generating pages dynamically client-side has hit this issue and solved it but I thought I'd ask the dev community before diving in and trying to solve this.
2019 Update
Tech has moved on significantly. I would encourage anyone looking to create SSR (server-side rendered) and client-side web apps in one isomorphic code base to take a look at the excellent Next.js.
Next.js wraps React with a server-side routing and rendering system built in Node.js, defines a standard interface to getting data for pages on server and client, and comes with some out of the box features that make it one of the best choices (IMHO) for both SSR and CSR web applications.
Oh... and they have a great tutorial too!
2013 Answer
I've managed to generate static pages from the client-side output by using PhantomJS and capturing the HTML output after the page and all JS has finished loading/executing. This method is slower than I would like and unlikely to scale well but it's the only option that I can think of so far.
The site already receives over 10,000 page views a day with over 8,000 unique visitors so pages get updated regularly as new comments / posts are created and then these changes are added to a queue which gets process in a separate server to generate static pages with Phantom.
The only other way I can think of doing this is to create a Node.js process that uses the same jsRender library and builds HTML output from the template files based on some data, but this would be time consuming to set up and would not generate the exact same output that the dynamic site creates. Google may frown on me serving it static pages that don't really represent the dynamic version that "normal" visitors can see.
This seems like an unsolvable issue. Either I generate the pages entirely server-side, or crawlers cannot index the pages. :(

Handlebars.js and SEO

I have read a great deal of discussions about javascript templating and Search Engine Optimization. Still, I haven't found a satisfying answer to the question (either poorly-documented or outdated).
Currently I am looking into handlebars.js as a client-side template solution, because I love the possibility to create helper functions. But what about indexing for search engines? Does the bot index the generated content (as intended) or only the source with the ugly javascript pseudo-variables? I know that there are lots of threads going on about this matter but I feel that nobody does exactly know the answer.
If engines like Google would not index these templates properly, why would one bother using this for public websites?
Another question within this context: Is it possible to render Handlebar.js templates on server side and then present them onto the client side? Obviously to avoid all this SEO discussion.
For dom crunching client side, most web bots (i.e. Google and others) don't interpret js on the fly and parse newly rendered content for indexing. Instead Google (and now Bing) support the 'Google Ajax Crawling Scheme' (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started) - which basically states that IF you want js rendered dom content to be indexed (i.e. rendering ajax call results), you must be able to:
Trigger the async js rendering via the url using hashbangs #! (i.e. http://www.mysite.com/#!my-state), and
Be able to serve a prerendered dom snapshot of your site AFTER js modification on request.
If using a client side MVC framework like Backbone.js, or Spine - you will need to provide this service if you want your web app indexed.
Generally this means you intercept a request made by the web bot (explained on the link above), and scrape your side server side using a headless browser (i.e. QT + capybara-webkit, HtmlUnit, etc.), then deliver the generated dom back to the requesting bot.
I've started a gem to do this in ruby (now taking pull requests) at https://github.com/benkitzelman/google-ajax-crawler
It does this as rack middleware using capybara-webkit (and soon phantomjs)
I do not know about Handlebar.js, but for my understanding SEO have some problem with CONTENT in JAVASCRIPT. Make sure your content is visible to Search Engine (use a spyder simulator for some test). Avoid spyder traps generally would be the way to go. Hope it could help you.
Search engines don't run JavaScript, so if you want to have your content indexed you'll need to render your templates on the server as well. You can use handlebars in Node (server-side JS) to render your template there when the page request comes from a spider. It's more work but it's possible. Github, google plus, and twitter all do something similar.
You could use Distal templates which puts the templates as part of the HTML for SEO.
See Spiderable for a temporary solution Meteor project (which uses Handlebars.js) uses for SEO purposes.
http://docs.meteor.com/#spiderable
Does the bot index the generated content (as intended) or only the source with the ugly javascript pseudo-variables?
Neither, because indexer bots don't run JavaScript and you don't serve up templates as HTML documents.
Build a site that works without JavaScript, then build on top of it.

Categories

Resources