I'm wondering if its possible to load a website like the twitter search results page using node.js with jsdom and still be able to detect any changes in the DOM after the initial render.
Example:
Initial page
Modified page
Thanks for your help.
If you want to detect what changed in the DOM tree you will need to understand the minified Twitter's javascript code and properly intercept the events or callbacks. I don't suggest you this approach because js code that works perfectly in browser may break when using jsdom. For instance, when I was trying to do some server side scrapping with jquery I got stuck with this bug that in most browsers is definitely not an issue.
For the time being, what I recommend is that you consume the Search API directly from Twitter which is a more straightforward approach.
Hope it helps!
Certainly the best approach is to leverage the Search API (or whatever relevant API) from Twitter to accomplish this.
But if, for some other project, you want to host a full browser environment (JS + dom + renderer) that remains alive and lets you interact with it, the PhantomJS project might work:
http://code.google.com/p/phantomjs/
It hosts a WebKit instance for automation.
Related
Referring to this ticket here: https://github.com/zeit/next.js/issues/4210
I'm currently wondering why when you disable javascript, most of the content using relay modern and NextJS does not work.
My initial guess is that since NextJS is a React library for server-side rendering, if JavaScript is disabled in chrome, then obviously React does not work. However, NextJS is server-side rendering so therefore disabling javascript on the client side should not be a problem? Therefore, why does this problem still occur?
In a modern SSR scenario, as in an isomorphic application, only the first rendering is done by the server, that returns plain html content together with the js's that will be used for the subsequents rendering.
In case the browser has javascript disabled you should be seeing only the first rendering as a static page, since what the interpreter does is showing a plain html content, but then you shouldn't be able to interact with the page (that would require js to be enabled)
While #Karim's answer is correct, it is worth pointing out that a user can, technically speaking, partially interact with the page if you use "progressive enhancement". In this case, you use native HTML features to perform operations such as navigation to other pages, form submission, etc. Those do not require JS to be enabled to work correctly. Depending on your target audience, this might be more pain than its worth.
Articles on React.js like to point out, that React.js is great for SEO purposes. Unfortunately, I've never read, how you actually do it.
Do you simply implement _escaped_fragment_ as in https://developers.google.com/webmasters/ajax-crawling/docs/getting-started and let React render the page on the server, when the url contains _escaped_fragment_, or is there more to it?
Being able not to rely on _escaped_fragment_ would be great, as probably not all potentially crawling sites (e.g. in sharing functionalities) implement _escaped_fragment_.
I'm pretty sure anything you've seen promoting React as being good for SEO has to do with being able to render the requested page on the server, before sending it to the client. So it will be indexed just like any other static page, as far as search engines are concerned.
Server rendering made possible via ReactDOMServer.renderToString. The visitor will receive the already rendered page of markup, which the React application will detect once it has downloaded and run. Instead of replacing the content when ReactDOM.render is called, it will just add the event bindings. For the rest of the visit, the React application will take over and further pages will be rendered on the client.
If you are interested in learning more about this, I suggest searching for "Universal JavaScript" or "Universal React" (formerly known as "isomorphic react"), as this is becoming the term for JavaScript applications that use a single code base to render on both the server and client.
As the other responder said, what you are looking for is an Isomorphic approach. This allows the page to come from the server with the rendered content that will be parsed by search engines. As another commenter mentioned, this might make it seem like you are stuck using node.js as your server-side language. While it is true that have javascript run on the server is needed to make this work, you do not have to do everything in node. For example, this article discusses how to achieve an isomorphic page using Scala and react:
Isomorphic Web Design with React and Scala
That article also outlines the UX and SEO benefits of this sort of isomorphic approach.
Two nice example implementations:
https://github.com/erikras/react-redux-universal-hot-example: Uses Redux, my favorite app state management framework
https://github.com/webpack/react-starter: Uses Flux, and has a very elaborate webpack setup.
Try visiting https://react-redux.herokuapp.com/ with javascript turned on and off, and watch the network in the browser dev tools to see the difference…
Going to have to disagree with a lot of the answers here since I managed to get my client-side React App working with googlebot with absolutely no SSR.
Have a look at the SO answer here. I only managed to get it working recently but I can confirm that there are no problems so far and googlebot can actually perform the API calls and index the returned content.
It is also possible via ReactDOMServer.renderToStaticMarkup:
Similar to renderToString, except this doesn't create extra DOM
attributes such as data-react-id, that React uses internally. This is
useful if you want to use React as a simple static page generator, as
stripping away the extra attributes can save lots of bytes.
There is nothing you need to do if you care about your site's rank on Google, because Google's crawler could handle JavaScript very well! You can check your site's SEO result by search site:your-site-url.
If you also care about your site's rank such as Baidu, and your sever side implemented by PHP, maybe you need this: react-php-v8js.
I want to analyse some data of a webpage, but here's the problem: the site has more pages which gets called with a __doPostBack function.
How can I "simulate" to go a page further and analyse this site, and so on..
At this time I analyse the data with JSoup in java - but I'm open to use some other language if it's necessary.
A postback-based system (.NET, Prado/PHP, etc) works in a manner that it keeps a complete snapshot of the browser contents on the server side. This is called a pagestate. Any attempt to manipulate with a client that is not JavaScript-capable is almost sure to fail.
What you need is a JavaScript-capable browser. The easiest solution I found is to use the framework Firefox is written in - XUL - to create such a desktop application. What you do is basically create a desktop application with a single browser element in it, which you can then script from the application itself without restrictions of the security container. Alternatively, you could also use the Greasemonkey plugin to do your bidding. The latter is a bit easier to get started with, but it's fairly limited since it's running on a per-page basis.
With both solutions you then have access to the page's DOM to gather data and you can also fire events (like clicking on a button). Unfortunately you have to learn JavaScript for this to work.
I used an automation library which is Selenium, which you can use in a lot of languages (C#, Java, Perl,...)
For more information how to start this link is very helpful: this.
As well as Selenium, you can use http://watin.org/
Either my google-fu has failed me or there really aren't too many people doing this yet. As you know, Backbone.js has an achilles heel--it cannot serve the html it renders to page crawlers such as googlebot because they do not run JavaScript (although given that its Google with their resources, V8 engine, and the sobering fact that JavaScript applications are on the rise, I expect this to someday happen). I'm aware that Google has a hashbang workaround policy but it's simply a bad idea. Plus, I'm using PushState. This is an extremely important issue for me and I would expect it to be for others as well. SEO is something that cannot be ignored and thus cannot be considered for many applications out there that require or depend on it.
Enter node.js. I'm only just starting to get into this craze but it seems possible to have the same Backbone.js app that exists on the client be on the server holding hands with node.js. node.js would then be able to serve html rendered from the Backbone.js app to page crawlers. It seems feasible but I'm looking for someone who is more experienced with node.js or even better, someone who has actually done this, to advise me on this.
What steps do I need to take to allow me to use node.js to serve my Backbone.js app to web crawlers? Also, my Backbone app consumes an API that is written in Rails which I think would make this less of a headache.
EDIT: I failed to mention that I already have a production app written in Backbone.js. I'm looking to apply this technique to that app.
First of all, let me add a disclaimer that I think this use of node.js is a bad idea. Second disclaimer: I've done similar hacks, but just for the purpose of automated testing, not crawlers.
With that out of the way, let's go. If you intend to run your client-side app on server, you'll need to recreate the browser environment on your server:
Most obviously, you're missing the DOM (Document Object Model) - basically the AST on top of your parsed HTML document. The node.js solution for this is jsdom.
That however will not suffice. Your browser also exposes BOM (Browser Object Model) - access to browser features like, for example, history.pushState. This is where it gets tricky. There are two options: you can try to bend phantomjs or casperjs to run your app and then scrape the HTML off it. It's fragile since you're running a huge full WebKit browser with the UI parts sawed off.
The other option is Zombie - which is lightweight re-implementation of browser features in Javascript. According to the page it supports pushState, but my experience is that the browser emulation is far from complete - however give it a try and see how far you get.
I'm going to leave it to you to decide whether pushing your rendering engine to the server side is a sound decision.
Because Nodejs is built on V8 (Chrome's engine) it will run javascript, like Backbone.js. Creating your models and so forth would be done in exactly the same way.
The Nodejs environment of course lacks a DOM. So this is the part you need to recreate. I believe the most popular module is:
https://github.com/tmpvar/jsdom
Once you have an accessible DOM api in Nodejs, you simply build its nodes as you would for a typical browser client (maybe using jQuery) and respond to server requests with rendered HTML (via $("myDOM").html() or similar).
I believe you can take a fallback strategy type approach. Consider what would happen with javascript turned off and a link clicked vs js on. Anything you do on your page that can be crawled should have some reasonable fallback procedure when javascript is turned off. Your links should always have the link to the server as the href, and the default action happening should be prevented with javascript.
I wouldn't say this is backbone's responsibility necessarily. I mean the only thing backbone can help you with here is modifying your URL when the page changes and for your models/collections to be both client and server side. The views and routers I believe would be strictly client side.
What you can do though is make your jade pages and partial renderable from the client side or server side with or without content injected. In this way the same page can be rendered in either way. That is if you replace a big chunk of your page and change the url then the html that you are grabbing can be from the same template as if someone directly went to that page.
When your server receives a request it should directly take you to that page rather than go through the main entry point and the load backbone and have it manipulate the page and set it up in a way that the user intends with the url.
I think you should be able to achieve this just by rearranging things in your app a bit. No real rewriting just a good amount of moving things around. You may need to write a controller that will serve you html files with content injected or not injected. This will serve to give your backbone app the html it needs to couple with the data from the models. Like I said those same templates can be used when you directly hit those links through the routers defined in express/node.js
This is on my todo list of things to do with our app: have Node.js parse the Backbone routes (stored in memory when the app starts) and at the very least serve the main pages template at straight HTML—anything more would probably be too much overhead /processing for the BE when you consider thousands of users hitting your site.
I believe Backbone apps like AirBnB do it this way as well but only for Robots like Google Crawler. You also need this situation for things like Facebook likes as Facebook sends out a crawler to read your og:tags.
Working solution is to use Backbone everywhere
https://github.com/Morriz/backbone-everywhere but it forces you to use Node as your backend.
Another alternative is to use the same templates on the server and front-end.
Front-end loads Mustache templates using require.js text plugin and the server also renders the page using the same Mustache templates.
Another addition is to also render bootstrapped module data in javascript tag as JSON data to be used immediately by Backbone to populate models and collections.
Basically you need to decide what it is that you're serving: is it a true app (i.e. something that could stand in as a replacement for a dedicated desktop application), or is it a presentation of content (i.e. classical "web page")? If you're concerned about SEO, it's likely that it's actually the latter ("content site") and in that case the "single-page app" model isn't appropriate; you really want the "progressively enhanced website" model instead (look up such phrases as "unobtrusive JavaScript", "progressive enhancement" and "adaptive Web design").
To amplify a little, "server sends only serialized data and client does all rendering" is only appropriate in the "true app" scenario. For the "content site" scenario, the appropriate model is "server does main rendering, client makes it look better and does some small-scale rendering to avoid disruptive page transitions when possible".
And, by the way, the objection that progressive enhancement means "making sure that a user can see doesn't get anything better than a blind user who uses text-to-speech" is an expression of political resentment, not reality. Progressively enhanced sites can be as fancy as you want them to from the perspective of a user with a high-end rendering system.
I'm writing Angular.js application. I want it to be really fast, therefore I serve it completely generated server-side when it is initially loaded. After that every change should be handled client-side by Angular with asynchronous communication with server.
I have ng-view attribute on central <div>. But now Angular regenerates content of this <div> even on first load, before clicking any link. I don't want this behavior because then the server-side generation of page is useless.
How to achieve that?
Although Gloopy's suggestion will work in some cases, it will fail in others (namely ng-repeat). AngularJS does not currently have the ability to render on the server, but this is something that (as far as I know) no other JavaScript framework does either. I also know that server-side rendering is something that the AngularJS developers are looking into, so you may yet see it in the not-too-distant future. :)
When you say you want the application to be "really fast," you should consider where exactly you want this speed. There are a lot of places to consider speed, such as time it takes to bootstrap the app, time it takes to respond, resource intensiveness, etc (you seem to be focusing on bootstrap time). There are often different trade-offs that must be made to balance performance in an application. I'd recommend reading this response to another question on performance with AngularJS for more on the subject: Angular.js Backbone.js or which has better performance
Are you actually running into performance issues, or is this just something you predict to be a problem? If it's the later, I'd recommend building a prototype representative of your type of application to see if it really is an issue. If it's the former and it's taking your app too long to bootstrap on the client side, there may be some optimizations that you can make (for instance, inlining some model data to avoid an additional round trip, or using Gloopy's suggestion). You can also use the profiling tools in Chrome, as well as the AngularJS Batarang to look for slow areas in your application.
btford: You are absolutely right that this sounds like premature optimization - it sounds that way to me either :). Another reason is, that the application should work without JS in very simple way, so this basic layout (and angular does that for me for all other pages), so there will be always rendering on server.
I found a very hacky ugly solution - I bootstrap application after first click on any internal link. After clicking it, I unbind this initial callback, update URL with History.pushState and bootstrap app - it grabs the new URL and regeneration is absolutely OK. Well, I'll keep looking into not-too-distant future :).
I was able to come up with a solution for doing this. It doesn't work perfectly or for everything, but it is ok at least as far as routing and the directive I made that uses ng-repeat.
https://github.com/ithkuil/angular-on-server/wiki/Running-AngularJS-on-the-server-with-Node.js-and-jsdom