Sorting Special Characters in Javascript (Æ) - javascript

I'm trying to sort an array of objects based on the objects' name property. Some of the names start with 'Æ', and I'd like for them to be sorted as though they were 'Ae'. My current solution is the following:
myArray.sort(function(a, b) {
var aName = a.name.replace(/Æ/gi, 'Ae'),
bName = b.name.replace(/Æ/gi, 'Ae');
return aName.localeCompare(bName);
});
I feel like there should be a better way of handling this without having to manually replace each and every special character. Is this possible?
I'm doing this in Node.js if it makes any difference.

There is no simpler way. Unfortunately, even the way described in the question is too simple, at least if portability is of any concern.
The localeCompare method is by definition implementation-dependent, and it usually depends on the UI language of the underlying operating system, though it may also differ between browsers (or other JavaScript implementations) in the same computer. It can be hard to find any documentation on it, so even if you aim at writing non-portable code, you might need to do a lot of testing to see which collation order is applied. Cf. to Sorting strings is much harder than you thought!
So to have a controlled and portable comparison, you need to code it yourself, unless you are lucky enough to find someone else’s code that happens to suit your needs. On the positive side, the case conversion methods are one of the few parts of JavaScript that are localization-ready: they apply Unicode case mapping rules, so e.g. 'æ'.toUpperCase() yields Æ in any implementation.
In general, sorting strings requires a complicated function that applies specific sorting rules as defined for a language or by some other rules, such as the Pan-European sorting rules (intended for multilingual content). But if we can limit ourselves to sorting rules that deal with just a handful of letters in addition to Ascii, we can use code like the following simplified sorting for German (extract from by book Going Global with JavaScript and Globalize.js):
String.prototype.removeUmlauts = function () {
return this.replace(/Ä/g,'A').replace(/Ö/g,'O').replace(/Ü/g,'U');
};
function alphabetic(str1, str2) {
var a = str1.toUpperCase().removeUmlauts();
var b = str2.toUpperCase().removeUmlauts();
return a < b ? -1 : a > b ? 1 : 0;
}
You could adds other mappings, like replace(/Æ/gi, 'Ae'), to this, after analyzing the characters that may appear and deciding how to deal with them. Removing diacritic marks (e.g. mapping É to E) is simplistic but often good enough, and surely better than leaving it to implementations to decide whether É is somewhere after Z. And at least you would get consistent results across implementations, and you would see what things go wrong and need fixing, instead of waiting for other users complain that your code sorts all wrong (in their environment).

Related

Why is <= slower than < using this code snippet in V8?

I am reading the slides Breaking the Javascript Speed Limit with V8, and there is an example like the code below. I cannot figure out why <= is slower than < in this case, can anybody explain that? Any comments are appreciated.
Slow:
this.isPrimeDivisible = function(candidate) {
for (var i = 1; i <= this.prime_count; ++i) {
if (candidate % this.primes[i] == 0) return true;
}
return false;
}
(Hint: primes is an array of length prime_count)
Faster:
this.isPrimeDivisible = function(candidate) {
for (var i = 1; i < this.prime_count; ++i) {
if (candidate % this.primes[i] == 0) return true;
}
return false;
}
[More Info] the speed improvement is significant, in my local environment test, the results are as follows:
V8 version 7.3.0 (candidate)
Slow:
time d8 prime.js
287107
12.71 user
0.05 system
0:12.84 elapsed
Faster:
time d8 prime.js
287107
1.82 user
0.01 system
0:01.84 elapsed
Other answers and comments mention that the difference between the two loops is that the first one executes one more iteration than the second one. This is true, but in an array that grows to 25,000 elements, one iteration more or less would only make a miniscule difference. As a ballpark guess, if we assume the average length as it grows is 12,500, then the difference we might expect should be around 1/12,500, or only 0.008%.
The performance difference here is much larger than would be explained by that one extra iteration, and the problem is explained near the end of the presentation.
this.primes is a contiguous array (every element holds a value) and the elements are all numbers.
A JavaScript engine may optimize such an array to be an simple array of actual numbers, instead of an array of objects which happen to contain numbers but could contain other values or no value. The first format is much faster to access: it takes less code, and the array is much smaller so it will fit better in cache. But there are some conditions that may prevent this optimized format from being used.
One condition would be if some of the array elements are missing. For example:
let array = [];
a[0] = 10;
a[2] = 20;
Now what is the value of a[1]? It has no value. (It isn't even correct to say it has the value undefined - an array element containing the undefined value is different from an array element that is missing entirely.)
There isn't a way to represent this with numbers only, so the JavaScript engine is forced to use the less optimized format. If a[1] contained a numeric value like the other two elements, the array could potentially be optimized into an array of numbers only.
Another reason for an array to be forced into the deoptimized format can be if you attempt to access an element outside the bounds of the array, as discussed in the presentation.
The first loop with <= attempts to read an element past the end of the array. The algorithm still works correctly, because in the last extra iteration:
this.primes[i] evaluates to undefined because i is past the array end.
candidate % undefined (for any value of candidate) evaluates to NaN.
NaN == 0 evaluates to false.
Therefore, the return true is not executed.
So it's as if the extra iteration never happened - it has no effect on the rest of the logic. The code produces the same result as it would without the extra iteration.
But to get there, it tried to read a nonexistent element past the end of the array. This forces the array out of optimization - or at least did at the time of this talk.
The second loop with < reads only elements that exist within the array, so it allows an optimized array and code.
The problem is described in pages 90-91 of the talk, with related discussion in the pages before and after that.
I happened to attend this very Google I/O presentation and talked with the speaker (one of the V8 authors) afterward. I had been using a technique in my own code that involved reading past the end of an array as a misguided (in hindsight) attempt to optimize one particular situation. He confirmed that if you tried to even read past the end of an array, it would prevent the simple optimized format from being used.
If what the V8 author said is still true, then reading past the end of the array would prevent it from being optimized and it would have to fall back to the slower format.
Now it's possible that V8 has been improved in the meantime to efficiently handle this case, or that other JavaScript engines handle it differently. I don't know one way or the other on that, but this deoptimization is what the presentation was talking about.
I work on V8 at Google, and wanted to provide some additional insight on top of the existing answers and comments.
For reference, here's the full code example from the slides:
var iterations = 25000;
function Primes() {
this.prime_count = 0;
this.primes = new Array(iterations);
this.getPrimeCount = function() { return this.prime_count; }
this.getPrime = function(i) { return this.primes[i]; }
this.addPrime = function(i) {
this.primes[this.prime_count++] = i;
}
this.isPrimeDivisible = function(candidate) {
for (var i = 1; i <= this.prime_count; ++i) {
if ((candidate % this.primes[i]) == 0) return true;
}
return false;
}
};
function main() {
var p = new Primes();
var c = 1;
while (p.getPrimeCount() < iterations) {
if (!p.isPrimeDivisible(c)) {
p.addPrime(c);
}
c++;
}
console.log(p.getPrime(p.getPrimeCount() - 1));
}
main();
First and foremost, the performance difference has nothing to do with the < and <= operators directly. So please don't jump through hoops just to avoid <= in your code because you read on Stack Overflow that it's slow --- it isn't!
Second, folks pointed out that the array is "holey". This was not clear from the code snippet in OP's post, but it is clear when you look at the code that initializes this.primes:
this.primes = new Array(iterations);
This results in an array with a HOLEY elements kind in V8, even if the array ends up completely filled/packed/contiguous. In general, operations on holey arrays are slower than operations on packed arrays, but in this case the difference is negligible: it amounts to 1 additional Smi (small integer) check (to guard against holes) each time we hit this.primes[i] in the loop within isPrimeDivisible. No big deal!
TL;DR The array being HOLEY is not the problem here.
Others pointed out that the code reads out of bounds. It's generally recommended to avoid reading beyond the length of arrays, and in this case it would indeed have avoided the massive drop in performance. But why though? V8 can handle some of these out-of-bound scenarios with only a minor performance impact. What's so special about this particular case, then?
The out-of-bounds read results in this.primes[i] being undefined on this line:
if ((candidate % this.primes[i]) == 0) return true;
And that brings us to the real issue: the % operator is now being used with non-integer operands!
integer % someOtherInteger can be computed very efficiently; JavaScript engines can produce highly-optimized machine code for this case.
integer % undefined on the other hand amounts to a way less efficient Float64Mod, since undefined is represented as a double.
The code snippet can indeed be improved by changing the <= into < on this line:
for (var i = 1; i <= this.prime_count; ++i) {
...not because <= is somehow a superior operator than <, but just because this avoids the out-of-bounds read in this particular case.
TL;DR The slower loop is due to accessing the Array 'out-of-bounds', which either forces the engine to recompile the function with less or even no optimizations OR to not compile the function with any of these optimizations to begin with (if the (JIT-)Compiler detected/suspected this condition before the first compilation 'version'), read on below why;
Someone just has to say this (utterly amazed nobody already did):
There used to be a time when the OP's snippet would be a de-facto example in a beginners programming book intended to outline/emphasize that 'arrays' in javascript are indexed starting at 0, not 1, and as such be used as an example of a common 'beginners mistake' (don't you love how I avoided the phrase 'programing error' ;)): out-of-bounds Array access.
Example 1:
a Dense Array (being contiguous (means in no gaps between indexes) AND actually an element at each index) of 5 elements using 0-based indexing (always in ES262).
var arr_five_char=['a', 'b', 'c', 'd', 'e']; // arr_five_char.length === 5
// indexes are: 0 , 1 , 2 , 3 , 4 // there is NO index number 5
Thus we are not really talking about performance difference between < vs <= (or 'one extra iteration'), but we are talking:
'why does the correct snippet (b) run faster than erroneous snippet (a)'?
The answer is 2-fold (although from a ES262 language implementer's perspective both are forms of optimization):
Data-Representation: how to represent/store the Array internally in memory (object, hashmap, 'real' numerical array, etc.)
Functional Machine-code: how to compile the code that accesses/handles (read/modify) these 'Arrays'
Item 1 is sufficiently (and correctly IMHO) explained by the accepted answer, but that only spends 2 words ('the code') on Item 2: compilation.
More precisely: JIT-Compilation and even more importantly JIT-RE-Compilation !
The language specification is basically just a description of a set of algorithms ('steps to perform to achieve defined end-result'). Which, as it turns out is a very beautiful way to describe a language.
And it leaves the actual method that an engine uses to achieve specified results open to the implementers, giving ample opportunity to come up with more efficient ways to produce defined results.
A spec conforming engine should give spec conforming results for any defined input.
Now, with javascript code/libraries/usage increasing, and remembering how much resources (time/memory/etc) a 'real' compiler uses, it's clear we can't make users visiting a web-page wait that long (and require them to have that many resources available).
Imagine the following simple function:
function sum(arr){
var r=0, i=0;
for(;i<arr.length;) r+=arr[i++];
return r;
}
Perfectly clear, right? Doesn't require ANY extra clarification, Right? The return-type is Number, right?
Well.. no, no & no... It depends on what argument you pass to named function parameter arr...
sum('abcde'); // String('0abcde')
sum([1,2,3]); // Number(6)
sum([1,,3]); // Number(NaN)
sum(['1',,3]); // String('01undefined3')
sum([1,,'3']); // String('NaN3')
sum([1,2,{valueOf:function(){return this.val}, val:6}]); // Number(9)
var val=5; sum([1,2,{valueOf:function(){return val}}]); // Number(8)
See the problem ? Then consider this is just barely scraping the massive possible permutations...
We don't even know what kind of TYPE the function RETURN until we are done...
Now imagine this same function-code actually being used on different types or even variations of input, both completely literally (in source code) described and dynamically in-program generated 'arrays'..
Thus, if you were to compile function sum JUST ONCE, then the only way that always returns the spec-defined result for any and all types of input then, obviously, only by performing ALL spec-prescribed main AND sub steps can guarantee spec conforming results (like an unnamed pre-y2k browser).
No optimizations (because no assumptions) and dead slow interpreted scripting language remains.
JIT-Compilation (JIT as in Just In Time) is the current popular solution.
So, you start to compile the function using assumptions regarding what it does, returns and accepts.
you come up with checks as simple as possible to detect if the function might start returning non-spec conformant results (like because it receives unexpected input).
Then, toss away the previous compiled result and recompile to something more elaborate, decide what to do with the partial result you already have (is it valid to be trusted or compute again to be sure), tie in the function back into the program and try again. Ultimately falling back to stepwise script-interpretation as in spec.
All of this takes time!
All browsers work on their engines, for each and every sub-version you will see things improve and regress. Strings were at some point in history really immutable strings (hence array.join was faster than string concatenation), now we use ropes (or similar) which alleviate the problem. Both return spec-conforming results and that is what matters!
Long story short: just because javascript's language's semantics often got our back (like with this silent bug in the OP's example) does not mean that 'stupid' mistakes increases our chances of the compiler spitting out fast machine-code. It assumes we wrote the 'usually' correct instructions: the current mantra we 'users' (of the programming language) must have is: help the compiler, describe what we want, favor common idioms (take hints from asm.js for basic understanding what browsers can try to optimize and why).
Because of this, talking about performance is both important BUT ALSO a mine-field (and because of said mine-field I really want to end with pointing to (and quoting) some relevant material:
Access to nonexistent object properties and out of bounds array elements returns the undefined value instead of raising an exception. These dynamic features make programming in JavaScript convenient, but they also make it difficult to compile JavaScript into efficient machine code.
...
An important premise for effective JIT optimization is that programmers use dynamic features of JavaScript in a systematic way. For example, JIT compilers exploit the fact that object properties are often added to an object of a given type in a specific order or that out of bounds array accesses occur rarely. JIT compilers exploit these regularity assumptions to generate efficient machine code at runtime. If a code block satisfies the assumptions, the JavaScript engine executes efficient, generated machine code. Otherwise, the engine must fall back to slower code or to interpreting the program.
Source:
"JITProf: Pinpointing JIT-unfriendly JavaScript Code"
Berkeley publication,2014, by Liang Gong, Michael Pradel, Koushik Sen.
http://software-lab.org/publications/jitprof_tr_aug3_2014.pdf
ASM.JS (also doesn't like out off bound array access):
Ahead-Of-Time Compilation
Because asm.js is a strict subset of JavaScript, this specification only defines the validation logic—the execution semantics is simply that of JavaScript. However, validated asm.js is amenable to ahead-of-time (AOT) compilation. Moreover, the code generated by an AOT compiler can be quite efficient, featuring:
unboxed representations of integers and floating-point numbers;
absence of runtime type checks;
absence of garbage collection; and
efficient heap loads and stores (with implementation strategies varying by platform).
Code that fails to validate must fall back to execution by traditional means, e.g., interpretation and/or just-in-time (JIT) compilation.
http://asmjs.org/spec/latest/
and finally https://blogs.windows.com/msedgedev/2015/05/07/bringing-asm-js-to-chakra-microsoft-edge/
were there is a small subsection about the engine's internal performance improvements when removing bounds-check (whilst just lifting the bounds-check outside the loop already had an improvement of 40%).
EDIT:
note that multiple sources talk about different levels of JIT-Recompilation down to interpretation.
Theoretical example based on above information, regarding the OP's snippet:
Call to isPrimeDivisible
Compile isPrimeDivisible using general assumptions (like no out of bounds access)
Do work
BAM, suddenly array accesses out of bounds (right at the end).
Crap, says engine, let's recompile that isPrimeDivisible using different (less) assumptions, and this example engine doesn't try to figure out if it can reuse current partial result, so
Recompute all work using slower function (hopefully it finishes, otherwise repeat and this time just interpret the code).
Return result
Hence time then was:
First run (failed at end) + doing all work all over again using slower machine-code for each iteration + the recompilation etc.. clearly takes >2 times longer in this theoretical example!
EDIT 2: (disclaimer: conjecture based in facts below)
The more I think of it, the more I think that this answer might actually explain the more dominant reason for this 'penalty' on erroneous snippet a (or performance-bonus on snippet b, depending on how you think of it), precisely why I'm adament in calling it (snippet a) a programming error:
It's pretty tempting to assume that this.primes is a 'dense array' pure numerical which was either
Hard-coded literal in source-code (known excelent candidate to become a 'real' array as everything is already known to the compiler before compile-time) OR
most likely generated using a numerical function filling a pre-sized (new Array(/*size value*/)) in ascending sequential order (another long-time known candidate to become a 'real' array).
We also know that the primes array's length is cached as prime_count ! (indicating it's intent and fixed size).
We also know that most engines initially pass Arrays as copy-on-modify (when needed) which makes handeling them much more fast (if you don't change them).
It is therefore reasonable to assume that Array primes is most likely already an optimized array internally which doesn't get changed after creation (simple to know for the compiler if there is no code modifiying the array after creation) and therefore is already (if applicable to the engine) stored in an optimized way, pretty much as if it was a Typed Array.
As I have tried to make clear with my sum function example, the argument(s) that get passed higly influence what actually needs to happen and as such how that particular code is being compiled to machine-code. Passing a String to the sum function shouldn't change the string but change how the function is JIT-Compiled! Passing an Array to sum should compile a different (perhaps even additional for this type, or 'shape' as they call it, of object that got passed) version of machine-code.
As it seems slightly bonkus to convert the Typed_Array-like primes Array on-the-fly to something_else while the compiler knows this function is not even going to modify it!
Under these assumptions that leaves 2 options:
Compile as number-cruncher assuming no out-of-bounds, run into out-of-bounds problem at the end, recompile and redo work (as outlined in theoretical example in edit 1 above)
Compiler has already detected (or suspected?) out of bound acces up-front and the function was JIT-Compiled as if the argument passed was a sparse object resulting in slower functional machine-code (as it would have more checks/conversions/coercions etc.). In other words: the function was never eligable for certain optimisations, it was compiled as if it received a 'sparse array'(-like) argument.
I now really wonder which of these 2 it is!
To add some scientificness to it, here's a jsperf
https://jsperf.com/ints-values-in-out-of-array-bounds
It tests the control case of an array filled with ints and looping doing modular arithmetic while staying within bounds. It has 5 test cases:
1. Looping out of bounds
2. Holey arrays
3. Modular arithmetic against NaNs
4. Completely undefined values
5. Using a new Array()
It shows that the first 4 cases are really bad for performance. Looping out of bounds is a bit better than the other 3, but all 4 are roughly 98% slower than the best case.
The new Array() case is almost as good as the raw array, just a few percent slower.

o[str] vs (o => o.str)

I'm coding a function that takes an object and a projection to know on which field it has to work.
I'm wondering if I should use a string like this :
const o = {
a: 'Hello There'
};
function foo(o, str) {
const a = o[str];
/* ... */
}
foo(o, 'a');
Or with a function :
function bar(o, proj) {
const a = proj(o);
/* ... */
}
bar(o, o => o.a);
I think V8 is creating classes with my javascript objects. If I use a string to access dynamically a field, will it be still able to create a class with my object and not a hashtable or something else ?
V8 developer here. The answer to "which pattern should I use?" is probably "it depends". I can think of scenarios where the one or the other would (likely) be (a bit) faster, depending on your app's behavior. So I would suggest that you either try both (in real code, not a microbenchmark!) and measure yourself, or simply pick whichever you prefer and/or makes more sense in the larger context, and not worry about it until profiling shows that this is an actual bottleneck that's worth spending time on.
If the properties are indeed known at the call site, then the fastest option is probably to load the property before the call:
function baz(o, str, a) {
/* ... */
}
baz(o, "a", o.a);
I realize that if things actually were this simple, you probably wouldn't be asking this question; if that assumption is true then this is a great example for how simplifications in microbenchmarks can easily change what the right answer is.
The answer to the classes question is that this decision has no impact on how V8 represents your objects under the hood -- that mostly depends on how you modify your objects, not on how you read from them. Also, for the record:
every object has a "hidden class"; whether or not it uses hash table representation is orthogonal to that
whether hash table mode or shape-tracking mode is better for any given object is one of the things that depend on the use case, which is precisely why both modes exist. I wouldn't worry too much about it, unless you know (from profiling) that it happens to be a problem in your case (more often than not, V8's heuristics get it right; manual intervention is rarely necessary).

Translating conditional statements in string format to actual logic

I have a good knowledge of real time graphics programming and web development, and I've started a project that requires me to take a user-created conditional string and actually use those conditions in code. This is an entirely new kind of programming problem for me.
I've tried a few experiments using loops and slicing up the conditional string...but I feel like I am missing some kind of technique that would make this more efficient and straightforward. I have a feeling regular expressions may be useful here, but perhaps not.
Here is an example string:
"IF#VAR#>=2AND$VAR2$==1OR#VAR3#<=3"
The values for those actual variables will come from an array of objects. Also, the different marker symbols around the variables denote different object arrays where the actual value can be found (variable name is an index).
I have complete control over how the conditional string is formatted (adding symbols around IF/ELSE/ELSEIF AND/OR
well as special symbols around the different operands) so my options are fairly open. How would you approach such a programming problem?
The problem you're facing is called parsing and there are numerous solutions to it. First, you can write your own "interpreter" for your mini-language, including lexer (which splits the string into tokens), parser (which builds a tree structure from a stream of tokens) and executor, which walks the tree and computes the final value. Or you can use a parser generator like PEG and have the whole thing built for you automatically - you just provide the rules of your language. Finally, you can utilize javascript built-in parser/evaluator eval. This is by far the simplest option, but eval only understands javascript syntax - so you'll have to translate your language to javascript before eval'ing it. And since eval can run arbitrary code, it's not for use in untrusted environments.
Here's an example on how to use eval with your sample input:
expr = "#VAR#>=2AND$VAR2$==1OR#VAR3#<=3"
vars = {
"#": {"VAR":5},
"$": {"VAR2":1},
"#": {"VAR3":7}
}
expr = expr.replace(/([##$])(\w+)(\1)/g, function($0, $1, $2) {
return "vars['" + $1 + "']." + $2;
}).replace(/OR/g, "||").replace(/AND/g, "&&")
result = eval(expr) // returns true

Why is String.replace() with lambda slower than a while-loop repeatedly calling RegExp.exec()?

One problem:
I want to process a string (str) so that any parenthesised digits (matched by rgx) are replaced by values taken from the appropriate place in an array (sub):
var rgx = /\((\d+)\)/,
str = "this (0) a (1) sentence",
sub = [
"is",
"test"
],
result;
The result, given the variables declared above, should be 'this is a test sentence'.
Two solutions:
This works:
var mch,
parsed = '',
remainder = str;
while (mch = rgx.exec(remainder)) { // Not JSLint approved.
parsed += remainder.substring(0, mch.index) + sub[mch[1]];
remainder = remainder.substring(mch.index + mch[0].length);
}
result = (parsed) ? parsed + remainder : str;
But I thought the following code would be faster. It has fewer variables, is much more concise, and uses an anonymous function expression (or lambda):
result = str.replace(rgx, function() {
return sub[arguments[1]];
});
This works too, but I was wrong about the speed; in Chrome it's surprisingly (~50%, last time I checked) slower!
...
Three questions:
Why does this process appear to be slower in Chrome and (for example) faster in Firefox?
Is there a chance that the replace() method will be faster compared to the while() loop given a bigger string or array? If not, what are its benefits outside Code Golf?
Is there a way optimise this process, making it both more efficient and as fuss-free as the functional second approach?
I'd welcome any insights into what's going on behind these processes.
...
[Fo(u)r the record: I'm happy to be called out on my uses of the words 'lambda' and/or 'functional'. I'm still learning about the concepts, so don't assume I know exactly what I'm talking about and feel free to correct me if I'm misapplying the terms here.]
Why does this process appear to be slower in Chrome and (for example) faster in Firefox?
Because it has to call a (non-native) function, which is costly. Firefox's engine might be able to optimize that away by recognizing and inlining the lookup.
Is there a chance that the replace() method will be faster compared to the while() loop given a bigger string or array?
Yes, it has to do less string concatenation and assignments, and - as you said - less variables to initialize. Yet you can only test it to prove my assumptions (and also have a look at http://jsperf.com/match-and-substitute/4 for other snippets - you for example can see Opera optimizing the lambda-replace2 which does not use arguments).
If not, what are its benefits outside Code Golf?
I don't think code golf is the right term. Software quality is about readabilty and comprehensibility, in whose terms the conciseness and elegance (which is subjective though) of the functional code are the reasons to use this approach (actually I've never seen a replace with exec, substring and re-concatenation).
Is there a way optimise this process, making it both more efficient and as fuss-free as the functional second approach?
You don't need that remainder variable. The rgx has a lastIndex property which will automatically advance the match through str.
Your while loop with exec() is slightly slower than it should be, since you are doing extra work (substring) as you use exec() on a non-global regex. If you need to loop through all matches, you should use a while loop on a global regex (g flag enabled); this way, you avoid doing extra work trimming the processed part of the string.
var rgR = /\((\d+)\)/g;
var mch,
result = '',
lastAppend = 0;
while ((mch = rgR.exec(str)) !== null) {
result += str.substring(lastAppend, mch.index) + sub[mch[1]];
lastAppend = rgR.lastIndex;
}
result += str.substring(lastAppend);
This factor doesn't disturb the performance disparity between different browser, though.
It seems the performance difference comes from the implementation of the browser. Due to the unfamiliarity with the implementation, I cannot answer where the difference comes from.
In terms of power, exec() and replace() have the same power. This includes the cases where you don't use the returned value from replace(). Example 1. Example 2.
replace() method is more readable (the intention is clearer) than a while loop with exec() if you are using the value returned by the function (i.e. you are doing real replacement in the anonymous function). You also don't have to reconstruct the replaced string yourself. This is where replace is preferred over exec(). (I hope this answers the second part of question 2).
I would imagine exec() to be used for the purposes other than replacement (except for very special cases such as this). Replacement, if possible, should be done with replace().
Optimization is only necessary, if performance degrades badly on actual input. I don't have any optimization to show, since the 2 only possible options are already analyzed, with contradicting performance between 2 different browser. This may change in the future, but for now, you can choose the one that has better worst-performance-across-browser to work with.

Good resources for extreme minified JavaScript (js1k-style)

As I'm sure most of the JavaScripters out there are aware, there's a new, Christmas-themed js1k. I'm planning on entering this time, but I have no experience producing such minified code. Does anyone know any good resources for this kind of thing?
Google Closure Compiler is a good javascript minifier.
There is a good online tool for quick use, or you can download the tool and run it as part of a web site build process.
Edit: Added a non-exhaustive list of tricks that you can use to minify JavaScript extremely, before using a minifier:
Shorten long variable names
Use shortened references to built in variables like d=document;w=window.
Set Interval
The setInterval function can take either a function or a string. Pass in a string to reduce the number of characters used: setInterval('a--;b++',10). Note that passing in a string forces an eval invokation so it will be slower than passing in a function.
Reduce Mathematical Calculations
Example a=b+b+b can be reduced to a=3*b.
Use Scientific Notation
10000 can be expressed in scientific notation as 1E4 saving 2 bytes.
Drop leading Zeroes
0.2 = .2 saves a byte
Ternery Operator
if (a > b) {
result = x;
}
else {
result = y;
}
can be expressed as result=a>b?x:y
Drop Braces
Braces are only required for blocks of more than one statement.
Operator Precedence
Rely on operator precedence rather than adding unneeded brackets which aid code readability.
Shorten Variable Assignment
Rather than function x(){a=1,b=2;...}() pass values into the function, function x(a,b){...}(1,2)
Think outside the box
Don't automatically reach for standard ways of doing things. Rather than using d.getElementById('p') to get a reference to a DOM element, could you use b.children[4] where d=document;b=body.
Original source for above list of tricks:
http://thingsinjars.com/post/293/the-quest-for-extreme-javascript-minification/
Spolto is right.
Any code minifier won't do the trick alone. You need to first optimize your code and then make some dirty manual tweaks.
In addition to Spolto's list of tricks I want to encourage the use of logical operators instead of the classical if else syntax. ex:
The following code
if(condition){
exp1;
}else{
exp2;
}
is somewhat equivalent to
condition&&exp1||exp2;
Another thing to consider might be multiple variable declaration :
var a = 1;var b = 2;var c = 1;
can be rewritten as :
var a=c=1,b=2;
Spolto is also right about the braces. You should drop them. But in addition, you should know that they can be dropped even for blocks of more expressions by writing the expressions delimited by a comma(with a leading ; of course) :
if(condition){
exp1;
exp2;
exp3;
}else{
exp4;
exp5;
}
Can be rewritten as :
if(condition)exp1,exp2,exp3;
else exp4,exp5;
Although it's not much (it saves you only 1 character/block for those who are counting), it might come in handy. (By the way, the latest Google Closure Compiler does this trick too).
Another trick worth mentioning is the controversial with functionality.
If you care more about the size, then you should use this because it might reduce code size.
For example, let's consider this object method:
object.method=function(){
this.a=this.b;
this.c++;
this.d(this.e);
}
This can be rewritten as :
object.method=function(){
with(this){
a=b;
c++;
d(e);
}
}
which is in most cases signifficantly smaller.
Something that most code packers & minifiers do not do is replacing large repeating tokens in the code with smaller ones. This is a nasty hack that also requires the use of eval, but since we're in it for the space, I don't think that should be a problem. Let's say you have this code :
a=function(){/*code here*/};
b=function(){/*code here*/};
c=function(){/*code here*/};
/*...*/
z=function(){/*code here*/};
This code has many "function" keywords repeating. What if you could replace them with a single(unused) character and then evaluate the code?
Here's how I would do it :
eval('a=F(){/*codehere*/};b=F(){/*codehere*/};c=F(){/*codehere*/};/*...*/z=F(){/*codehere*/};'.replace(/function/g,'F'));
Of course the replaced token(s) can be anything since our code is reduced to an evaluated string (ex: we could've replaced =function(){ with F, thus saving even more characters).
Note that this technique must be used with caution, because you can easily screw up your code with multiple text replacements; moreover, you should use it only in cases where it helps (ex: if you only have 4 function tokens, replacing them with a smaller token and then evaluating the code might actually increase the code length :
var a = "eval(''.replace(/function/g,'F'))".length,
b = ('function'.length-'F'.length)*4;
alert("you should" + (a<b?"":" NOT") + " use this technique!");
In the following link you'll find surprisingly good tricks to minify js code for this competition:
http://www.claudiocc.com/javascript-golfing/
One example: (extracted from section Short-circuit operators):
if (p) p=q; // before
p=p&&q; // after
if (!p) p=q; // before
p=p||q; // after
Or the more essoteric Canvas context hash trick:
// before
a.beginPath
a.fillRect
a.lineTo
a.stroke
a.transform
a.arc
// after
for(Z in a)a[Z[0]+(Z[6]||Z[2])]=a[Z];
a.ba
a.fc
a.ln
a.sr
a.to
a.ac
And here is another resource link with amazingly good tricks: https://github.com/jed/140bytes/wiki/Byte-saving-techniques
First off all, just throwing your code into a minifier won't help you that much. You need to have the extreme small file size in mind when you write the code. So in part, you need to learn all the tricks yourself.
Also, when it comes to minifiers, UglifyJS is the new shooting star here, its output is smaller than GCC's and it's way faster too. And since it's written in pure JavaScript it should be trivial for you to find out what all the tricks are that it applies.
But in the end it all comes down to whether you can find an intelligent, small solution for something that's awsome.
Also:
Dean Edwards Packer
http://dean.edwards.name/packer/
Uglify JS
http://marijnhaverbeke.nl/uglifyjs
A friend wrote jscrush packer for js1k.
Keep in mind to keep as much code self-similar as possible.
My workflow for extreme packing is: closure (pretty print) -> hand optimizations, function similarity, other code similarity -> closure (whitespace only) -> jscrush.
This packs away about 25% of the data.
There's also packify, but I haven't tested that myself.
This is the only online version of #cowboy's packer script:
http://iwantaneff.in/packer/
Very handy for packing / minifying JS

Categories

Resources