Implementing HTML DOM rewriting logic on the server

Implementing HTML DOM rewriting logic on the server - javascript

My application does massive rewriting of the DOM on the client side at load time. It traverses the page scanning for special markup (think Markdown) or other patterns, replacing them with sometimes rather complicated DOM structures (using DOM calls such as createElement) to style text but also create diagrams and graphics.
I adopted this architecture in order to avoid any build or preprocessing steps. It works fine in a desktop browser, but is noticeably slow on mobile devices (several seconds, even after relentless optimization). So I would like to rearchitect the system to pre-scan the page and pre-build the DOM. I'm having a bit of a mental block figuring out how to do this. Obviously, I would prefer not to rewrite all the Javascript in some other server-side language. Also, I would like to preserve the option to do the building at load time as I do now, with the basic rewriting logic sharing the same code.
The most likely-sounding option is to build this as a node app, although I am a node beginner. using jsdom both to parse the input and modify the DOM. Or, since I am an a fan of XSLT, and intrigued by Saxon-CE, even though it would mean rewriting everything, I also considered implementing the scanning/rewriting logic in XSLT, to be invoked either from node (for the pre-building case--do people use Saxon from node?) or the browser (for the load-time building case).
Can anyone comment on this approach or throw out alternative ideas?

Not sure what specific use cases you are tackling with the massive DOM rewriting, nor am I sure what your throughput requirements are. That said, one alternate approach to the node/jsdom route could be to run a farm of headless Webkit browsers, and run your current JavaScript as-is in that live rendering context. That would allow you to offload processing from those pokey mobile CPU's into arbitrarily scalable cloud resources (assuming this might be affordable for your project/business), and skirt the need to rewrite or tweak your current, working code at all.

Sounds like you want Node. If you already know JavaScript it really is a cinch to pick up.
I would recommend a tutorial like this one: http://www.nodebeginner.org/
It will take you an hour-ish to get through but gives a solid overview of Node as you build a small but functional app along with the author.

Related

How to download and query html pages where JS processing is necessary?

I often compile informal datasets by running some kind of XPath/XQuery on publicly available web pages. Usually the structure of the HTML is regular enough that useful information can be extracted easily.
But today I've come across tunefind.com. This website makes extensive use of the REACTJS framework, and so most of the structure of the page is configured client-side by Javascript. The pages, when initially downloaded, are very basic and missing a lot of information. The pages are populated by a script that uses a hopelessly messy blob of JSON data at the bottom of the page.
The only way I can think of to deal with this would be to use some kind of GUI-based web engine and just not display the GUI part. But that is a preposterous amount of work for these casual little CLI tools that I use to gather information.
Is there any way to perform the javascript preprocessing without dealing with unnecessary graphics?

Even if you were to process without the graphics the react javascript will be geared towards running in a browser context, at the very least it will expect a functioning DOM to exist, the application itself may also require clicks / transitions to happen before you can see some data.
Your best bet then is to load the page in a browser, to keep this simple, there are plenty of good browser automation frameworks designed for this.
I've used a fair few libraries over the years including phantomJS and recently I've gotten the most mileage out of nightmarejs.
It runs an electron browser for you and gives you a useful promisified javascript API to control it with, that has common browser functions such as clicking, following links etc.
You can configure it to hide the browser which is useful for making a CLI tool, however its a bit of a pseudo-headless mode and will still require a windowing/graphical context (e.g. x window).
Hope this helps.
PS - If you're at all used to docker it's not hard to make this just a running container!

Managing JavaScript complexity in a large project

What should I use to manage growing number of JavaScript files in my application?
We are building a django application with several apps. Each app has different functionality, and has to be rendered in three different modes (pc, tablet, mobile). There is a lot of things happening in JavaScript: managing data received from the server, handling user events, injecting HTML snippets, and loading sub-components. Some of the functinality is shared between apps and view modes, but often it makes sense to write a specific functions (for example, hover and click events may have to be handled differently on a PC layout vs. a tablet layout) so we are grouping this in files based on app/layout/function.
Up to a point we were using a flat file structure with naming to differentiate types of files:
ui.common.js
ui.app1.pc.handlers.js
ui.app1.pc.domManupulators.js
ui.app1.tablet.js
ui.app2.pc.js
...
Right now, however, as the number of apps (and corner cases) grows this way is fast becoming unusuable (we're approaching 20+ files and expecting maybe 40+ by the time we're done), so we are putting everything in directories like so:
js/
common/
core1.js
ajax2.js
app1/
tablet.js
pc.js
app2/
mobile.js
...
I have been looking at JavaScriptMVC to help with this. While it does offer useful tools it doesn't seem to have anything that would specifically make managing our giant JavaScript library better. We are expanding our dev team soon and code maintainability is very important.
Is there something that may make our life easier? Are there any habits/rules of thumb you use in your work that could alleviate this?

Backbone.js is used to organize javascript heavy applications in an MVC-style pattern. It's going to take some learning, but it's definitely something you'll want to look into and learn a bit about even if you don't end up using it.
It's used on quite a few pretty impressive projects
And, here's a site to learn more with tutorials.

Typically, grouping libraries by commonality (like your second example) would be preferred. However, more importantly would be making sure you have namespaced or otherwise make them unique so that you are unlikely to get naming collisions with other potential scripts.

Using Selenium-IDE with a rich Javascript application?

Problem
At my workplace, we're trying to find the best way to create automated-tests for an almost wholly javascript-driven intranet application. Right now we're stuck trying to find a good tradeoff between:
Application code in reusable and nest-able GUI components.
Tests which are easily created by the testing team
Tests which can be recorded once and then automated
Tests which do not break after small cosmetic changes to the site
XPath expressions (or other possible expressions, like jQuery selectors) naively generated from Selenium-IDE are often non-repeatable and very fragile. Conversely, having the JS code generate special unique ID values for every important DOM-element on the page... well, that is its own headache, complicated by re-usable GUI components and IDs needing to be consistent when the test is re-run.
What successes have other people had with this kind of thing? How do you do automated application-level testing of a rich JS interface?
Limitations
We are using JavascriptMVC 2.0, hopefully 3.0 soon so that we can upgrade to jQuery 1.4.x.
The test-making folks are mostly trained to use Selenium IDE to directly record things.
The test leads would prefer a page-unique HTML ID on each clickable element on the page...
Training the testers to write or alter special expressions (such as telling them which HTML class-names are important branching points) is a no-go.
We try to make re-usable javascript components, but this means very few GUI components can treat themselves (or what they contain) as unique.
Some of our components already use HTML ID values in their operation. I'd like to avoid doing this anyway, but it complicates the idea of ID-based testing.
It may be possible to add custom facilities (like a locator-builder or new locator method) to the Selenium-IDE installation testers use.
Almost everything that goes on occurs within a single "page load" from a conventional browser perspective, even when items are saved
Current thoughts
I'm considering a system where a custom locator-builder (javascript code) for Selenium-IDE will talk with our application code as the tester is recording. In this way, our application becomes partially responsible for generating a mostly-flexible expression (XPath or jQuery) for any given DOM element. While this can avoid requiring more training for testers, I worry it may be over-thinking things.

Record and Playback will not work in large scale testing. It may work for smoke tests and small repetitive tasks.
Instead of trying to generate unique IDs, try to solve that with CSS based selectors. Generating unique ids is ideal goal but I don't think that is possible in all practical cases.
If you trying to look for custom locators, it is better to look into BDD.

Can't you use css selectors with Selenium? That seems a little more straightforward than using XPath.
http://saucelabs.com/blog/index.php/2010/01/selenium-totw-css-selectors-in-selenium-demystified/

Are GWT wrappers on top of javascript libraries discouraged?

I'm in a process of selecting an API for building a GWT application. The answer to the following questions will help me choose among a set of libraries.
Does a third-party code rewritten in
GWT run faster than a code using a
wrapped JavaScript library?
Will code using a wrapped library
have the same performance as a pure
GWT code if the underlying
JavaScript framework is well written
and tuned?

While JavaScript libraries get a lot of programming eyeballs and attention, GWT has the advantage of being able to doing some hideously not-human-readable things to the generated JavaScript code per browser for the sake of performance.
In theory, anything the GWT compiler does, the JavaScript writers should be able to do. But in practice the JS library writers have to maintain their code. Look at the jQuery code. It's obviously not optimized per browser. With some effort, I could take jQuery and target it for Safari only, saving a lot of code and speeding up what remains.
It's an ongoing battle. The JavaScript libraries compete against each other, getting faster all the time. GWT gets better and better, and has the advantage of being able to write ugly unmaintainable JavaScript per browser.
For any given task, you'll have to test to see where the arms race currently places us, and it'll likely vary between browsers.

In some cases you don't have another option. You can not rewrite everything when moving to GWT.
In a first step you could just wrap your existing code in a wrapper and if it turns out to be a performance bottleneck you can still move the code to Java/GWT
The code optimisation in GWT will certainly be better than what the majority of JS developpers can write. And when the Browsers change, it is just a matter of modifying the GWT optimizer and your code will be better tuned for the latest advances in Js technology.

Depends on how well the code is
written.
I would think so.
Generally look at the community around a 3rd party library before using it unless it is open-source (so you can fix bugs) and specifically look for posts concerning bugs - how quick do the maintainers respond to items. How long is a release cycle, etc.

When should I use Inline vs. External Javascript?

I would like to know when I should include external scripts or write them inline with the html code, in terms of performance and ease of maintenance.
What is the general practice for this?
Real-world-scenario - I have several html pages that need client-side form validation. For this I use a jQuery plugin that I include on all these pages. But the question is, do I:
write the bits of code that configure this script inline?
include all bits in one file that's share among all these html pages?
include each bit in a separate external file, one for each html page?
Thanks.

At the time this answer was originally posted (2008), the rule was simple: All script should be external. Both for maintenance and performance.
(Why performance? Because if the code is separate, it can easier be cached by browsers.)
JavaScript doesn't belong in the HTML code and if it contains special characters (such as <, >) it even creates problems.
Nowadays, web scalability has changed. Reducing the number of requests has become a valid consideration due to the latency of making multiple HTTP requests. This makes the answer more complex: in most cases, having JavaScript external is still recommended. But for certain cases, especially very small pieces of code, inlining them into the site’s HTML makes sense.

Maintainability is definitely a reason to keep them external, but if the configuration is a one-liner (or in general shorter than the HTTP overhead you would get for making those files external) it's performance-wise better to keep them inline. Always remember, that each HTTP request generates some overhead in terms of execution time and traffic.
Naturally this all becomes irrelevant the moment your code is longer than a couple of lines and is not really specific to one single page. The moment you want to be able to reuse that code, make it external. If you don't, look at its size and decide then.

If you only care about performance, most of advice in this thread is flat out wrong, and is becoming more and more wrong in the SPA era, where we can assume that the page is useless without the JS code. I've spent countless hours optimizing SPA page load times, and verifying these results with different browsers. Across the board the performance increase by re-orchestrating your html, can be quite dramatic.
To get the best performance, you have to think of pages as two-stage rockets. These two stages roughly correspond to <head> and <body> phases, but think of them instead as <static> and <dynamic>. The static portion is basically a string constant which you shove down the response pipe as fast as you possibly can. This can be a little tricky if you use a lot of middleware that sets cookies (these need to be set before sending http content), but in principle it's just flushing the response buffer, hopefully before jumping into some templating code (razor, php, etc) on the server. This may sound difficult, but then I'm just explaining it wrong, because it's near trivial. As you may have guessed, this static portion should contain all javascript inlined and minified. It would look something like
<!DOCTYPE html>
<html>
<head>
<script>/*...inlined jquery, angular, your code*/</script>
<style>/* ditto css */</style>
</head>
<body>
<!-- inline all your templates, if applicable -->
<script type='template-mime' id='1'></script>
<script type='template-mime' id='2'></script>
<script type='template-mime' id='3'></script>
Since it costs you next to nothing to send this portion down the wire, you can expect that the client will start receiving this somewhere around 5ms + latency after connecting to your server. Assuming the server is reasonably close this latency could be between 20ms to 60ms. Browsers will start processing this section as soon as they get it, and the processing time will normally dominate transfer time by factor 20 or more, which is now your amortized window for server-side processing of the <dynamic> portion.
It takes about 50ms for the browser (chrome, rest maybe 20% slower) to process inline jquery + signalr + angular + ng animate + ng touch + ng routes + lodash. That's pretty amazing in and of itself. Most web apps have less code than all those popular libraries put together, but let's say you have just as much, so we would win latency+100ms of processing on the client (this latency win comes from the second transfer chunk). By the time the second chunk arrives, we've processed all js code and templates and we can start executing dom transforms.
You may object that this method is orthogonal to the inlining concept, but it isn't. If you, instead of inlining, link to cdns or your own servers the browser would have to open another connection(s) and delay execution. Since this execution is basically free (as the server side is talking to the database) it must be clear that all of these jumps would cost more than doing no jumps at all. If there were a browser quirk that said external js executes faster we could measure which factor dominates. My measurements indicate that extra requests kill performance at this stage.
I work a lot with optimization of SPA apps. It's common for people to think that data volume is a big deal, while in truth latency, and execution often dominate. The minified libraries I listed add up to 300kb of data, and that's just 68 kb gzipped, or 200ms download on a 2mbit 3g/4g phone, which is exactly the latency it would take on the same phone to check IF it had the same data in its cache already, even if it was proxy cached, because the mobile latency tax (phone-to-tower-latency) still applies. Meanwhile, desktop connections that have lower first-hop latency typically have higher bandwidth anyway.
In short, right now (2014), it's best to inline all scripts, styles and templates.
EDIT (MAY 2016)
As JS applications continue to grow, and some of my payloads now stack up to 3+ megabytes of minified code, it's becoming obvious that at the very least common libraries should no longer be inlined.

Externalizing javascript is one of the yahoo performance rules:
http://developer.yahoo.com/performance/rules.html#external
While the hard-and-fast rule that you should always externalize scripts will generally be a good bet, in some cases you may want to inline some of the scripts and styles. You should however only inline things that you know will improve performance (because you've measured this).

i think the specific to one page, short script case is (only) defensible case for inline script

Actually, there's a pretty solid case to use inline javascript. If the js is small enough (one-liner), I tend to prefer the javascript inline because of two factors:
Locality. There's no need to navigate an external file to validate the behaviour of some javascript
AJAX. If you're refreshing some section of the page via AJAX, you may lose all of your DOM handlers (onclick, etc) for that section, depending on how you binded them. For example, using jQuery you can either use the live or delegate methods to circumvent this, but I find that if the js is small enough it is preferrable to just put it inline.

Another reason why you should always use external scripts is for easier transition to Content Security Policy (CSP). CSP defaults forbid all inline script, making your site more resistant to XSS attacks.

I would take a look at the required code and divide it into as many separate files as needed. Every js file would only hold one "logical set" of functions etc. eg. one file for all login related functions.
Then during site developement on each html page you only include those that are needed.
When you go live with your site you can optimize by combining every js file a page needs into one file.

The only defense I can offer for inline javascipt is that when using strongly typed views with .net MVC you can refer to c# variables mid javascript which I've found useful.

On the point of keeping JavaScript external:
ASP.NET 3.5SP1 recently introduced functionality to create a Composite script resource (merge a bunch of js files into one). Another benefit to this is when Webserver compression is turned on, downloading one slightly larger file will have a better compression ratio then many smaller files (also less http overhead, roundtrip etc...). I guess this saves on the initial page load, then browser caching kicks in as mentioned above.
ASP.NET aside, this screencast explains the benefits in more detail:
http://www.asp.net/learn/3.5-SP1/video-296.aspx

Three considerations:
How much code do you need (sometimes libraries are a first-class consumer)?
Specificity: is this code only functional in the context of this specific document or element?
Every code inside the document tends to make it longer and thus slower. Besides that SEO considerations make it obvious, that you minimize internal scripting ...

External scripts are also easier to debug using Firebug. I like to Unit Test my JavaScript and having it all external helps. I hate seeing JavaScript in PHP code and HTML it looks like a big mess to me.

Another hidden benefit of external scripts is that you can easily run them through a syntax checker like jslint. That can save you from a lot of heartbreaking, hard-to-find, IE6 bugs.

In your scenario it sounds like writing the external stuff in one file shared among the pages would be good for you. I agree with everything said above.

During early prototyping keep your code inline for the benefit of fast iteration, but be sure to make it all external by the time you reach production.
I'd even dare to say that if you can't place all your Javascript externally, then you have a bad design under your hands, and you should refactor your data and scripts

Google has included load times into it's page ranking measurements, if you inline a lot, it will take longer for the spiders to crawl thru your page, this may be influence your page ranking if you have to much included. in any case different strategies may have influence on your ranking.

well I think that you should use inline when making single page websites as scripts will not need to be shared across multiple pages

Having internal JS pros:
It's easier to manage & debug
You can see what's happening
Internal JS cons:
People can change it around, which really can annoy you.
external JS pros:
no changing around
you can look more professional (or at least that's what I think)
external JS cons:
harder to manage
its hard to know what's going on.

Always try to use external Js as inline js is always difficult to maintain.
Moreover, it is professionally required that you use an external js since majority of the developers recommend using js externally.
I myself use external js.

Develop Reference

JavaScript is the programming language of the Web.