JavaScript String split: fixed-width vs. delimited performance

JavaScript String split: fixed-width vs. delimited performance - javascript

I am building a string to be parsed into an array by JavaScript. I can make it delimited or I can make the fields fixed-width. To test it, I built this jsperf test using a data string where the fields are both fixed-width and comma-delimited:
https://jsperf.com/string-split-fixed
I have only tested on Windows with Firefox and Chrome, so please run the test from other OSes and browsers. My two test results are clear: String.prototype.split() is the winner by a large margin.
Is my fixed-width code not efficient enough, or is the built-in string split function simply superior? Is there a way to code it so that the fixed-width parsing triumphs? If this was C/C++, the fixed-width code, written properly, would be the clear winner. But I know JavaScript is an entirely different beast.

String.prototype.split() is a built-in JavaScript function. Expect it to be highly optimized for the particular JS engine and be written not in JavaScript but in C++.
It should thus not come a surprise that you can't match its performance with pure JavaScript code.
String operations like splitting a delimited string are inherently memory-bound. Hence, knowing the location of delimiters doesn't really help much, since the entire string still needs to be traversed at least once (to copy the delimited fragments). Fixed-position splitting might be faster for strings that exceed D-cache size, but your string is just 13KB long, so traversing it multiple times isn't going to matter.

Related

JS lexing---multi line string

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?

Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.

There is also such syntax
const string =
'line1\
line2\
line3'

Javascript: substring time complexity [duplicate]

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you

AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)

This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

JavaScript String.split on a string literal to produce an array

I've seen a few javascript programmers use this pattern to produce an array:
"test,one,two,three".split(','); // => ["test", "one", "two", "three"]
They're not splitting user input or some variable holding a string value, they're splitting a hard-coded string literal to produce an array. In all of the cases I've seen a line like the above it would seem that it's perfectly reasonable to just use an array literal without relying on split to create an array from a string. Are there any reasons that the above pattern for creating an array makes sense, or is somehow more efficient than simply using an array literal?

When splitting a string at run-time instead of using an array literal, you are trading a small amount of execution time for a small amount of bandwidth.
In most cases I would argue that it is not worth it. If you are minifying and gzipping your code before publishing it, as you should be, using a single comma inside of a string versus a quote-comma-quote from two strings in an array would have little to no impact on bandwidth savings. In fact after minification and gzipping, the version using the split string could possibly be longer due to the addition of the less compressible .split(',').
Splitting a string instead of creating an array literal of strings does mean a little less typing, but we spend more time reading code than writing it. Using the array literal would be more maintainable in the future. If you wanted to add a comma to an item in the array, you just add it as another string in the array literal; using split you would have to re-write the whole string using a different separator.
The only situation that I use split and a string literal to create an array is if I need an array that consists only of single characters, i.e. the alphabet, numbers or alphanumeric characters, etc.
var letters = 'abcdefghijklmnopqrstuvwxyz'.split(''),
numbers = '0123456789'.split(''),
alphanumeric = letters.concat(numbers);
You'll notice that I used concat to create alphanumeric. If I had instead copy-pasted the contents of letters and numbers into one long string this code would compress better. The reason I did not is that that would be a micro-optimization that would hurt future maintainability. If in the future characters with accents, tildes or umlauts need to be added to the list of letters, they can be added in one place; no need to remember to copy-paste them into alphanumeric too.
Splitting a string may be useful for code golf, but in a production environment where minification and gzipping are factors and writing easily readable, maintainable code is important, just using an array literal is almost always the better choice.

For example, in ruby, the array ["test", "one", "two", "three"]
could also wrote as %w(test one two three), which save you some characters to type.
But javascript doesn't support such notation, so someone use split method to achieve it.

If a large number a arrays are built manually. It could potentially reduce the load time of your page since less characters are transmitted. But you would need a large number of arrays or large arrays to have a considerable difference. Performance an array creation might be slower but it is faster to type. For large arrays I use a spreadsheet to apply the formating around each value with a formula like this ="'"&A1&"',". I stick with the array literal.

It makes more sense to use the "split" method when you can't control the output, ie. user input or string output from a method. If you're trying to get a specific value in the output that is separated by something it is easier to use the split method. Of course, if you're controlling the values it doesn't always make sense.

JavaScript string concatenation speed

Can someone explain this one to me:
http://jsperf.com/string-concatenation-1/2
If you're lazy, I tested A) vs B):
A)
var innerHTML = "";
items.forEach(function(item) {
innerHTML += item;
});
B)
var innerHTML = items.join("");
Where items for both tests is the same 500-element array of strings, with each string being random and between 100 and 400 characters in length.
A) ends up being 10x faster. How can this be--I always thought concatenating using join("") was an optimization trick. Is there something flawed with my tests?

Using join("") was an optimization trick for composing large strings on IE6 to avoid O(n**2) buffer copies. It was never expected to be a huge performance win for composing small strings since the O(n**2) only really dominates the overhead of an array for largish n.
Modern interpreters get around this by using "dependent strings". See this mozilla bug for an explanation of dependent strings and some of the advantages and drawbacks.
Basically, modern interpreters knows about a number of different kinds of strings:
An array of characters
A slice (substring) of another string
A concatenation of two other strings
This makes concatenation and substring O(1) at the cost of sometimes keeping too much of a substringed buffer alive resulting in inefficiency or complexity in the garbage collector.
Some modern interpreters have played around with the idea of further decomposing (1) into byte[]s for ASCII only strings, and arrays of uint16s when a string contains a UTF-16 code unit that can't fit into one byte. But I don't know if that idea is actually in any interpreter.

Here the author of Lua programming language explains the buffer overhead that #Mike Samuel is telling about. The examples are in Lua, but the issue is the same in JavaScript.

Can I depend on the behavior of charCodeAt() and fromCharCode() to remain the same?

I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.

fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.

I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!

In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.

As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306

JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.

In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).

Develop Reference

JavaScript is the programming language of the Web.