Comparing large strings in JavaScript with a hash

Comparing large strings in JavaScript with a hash - javascript

I have a form with a textarea that can contain large amounts of content (say, articles for a blog) edited using one of a number of third party rich text editors. I'm trying to implement something like an autosave feature, which should submit the content through ajax if it's changed. However, I have to work around the fact that some of the editors I have as options don't support an "isdirty" flag, or an "onchange" event which I can use to see if the content has changed since the last save.
So, as a workaround, what I'd like to do is keep a copy of the content in a variable (let's call it lastSaveContent), as of the last save, and compare it with the current text when the "autosave" function fires (on a timer) to see if it's different. However, I'm worried about how much memory that could take up with very large documents.
Would it be more efficient to store some sort of hash in the lastSaveContent variable, instead of the entire string, and then compare the hash values? If so, can you recommend a good javascript library/jquery plugin that implements an appropriate hash for this requirement?

In short, you're better off just storing and comparing the two strings.
Computing a proper hash is not cheap. For example, check out the pseudo code or an actual JavaScript implementation for computing the MD5 hash of a string. Furthermore, all proper hash implementations will require enumerating the characters of the string anyway.
Furthermore, in the context of modern computing, a string has to be really, really long before comparing it against another string is slow. What you're doing here is effectively a micro-optimization. Memory won't be an issue, nor will the CPU cycles to compare the two strings.
As with all cases of optimizing: check that this is actually a problem before you solve it. In a quick test I did, computing and comparing 2 MD5 sums took 382ms. Comparing the two strings directly took 0ms. This was using a string that was 10000 words long. See http://jsfiddle.net/DjM8S.
If you really see this as an issue, I would also strongly consider using a poor-mans comparison; and just comparing the length of the 2 strings, to see if they have changed or not, rather than actual string comparisons.
..

An MD5 hash is often used to verify the integrity of a file or document; it should work for your purposes. Here's a good article on generating an MD5 hash in Javascript.

I made a JSperf rev that might be useful here for performance measuring. Please add different revisions and different types of checks to the ones I made!
http://jsperf.com/long-string-comparison/2
I found two major results
When strings are identical performance is murdered; from ~9000000 ops/s to ~250 ops/sec (chrome)
The 64bit version of IE9 is much slower on my PC, results from the same tests:
+------------+------------+
| IE9 64bit | IE9 32bit |
+------------+------------+
| 4,270,414 | 8,667,472 |
| 2,270,234 | 8,682,461 |
+------------+------------+
Sadly, jsperf logged both results as simply "IE 9".
Even a precursory look at JS MD5 performance tells me that it is very, very slow (at least for large strings, see http://jsperf.com/md5-shootout/18 - peaks at 70 ops/sec). I would want to go as far as to try AJAXing the hash calculation or the comparison to the backend but I don't have time to test, sorry!

Related

Store language (ISO 639) as Number

I'm working on a MongoDB database and so far have stored some information as Numbers instead of Strings because I assumed that would be more efficient. For example, I store countries following ISO 3166-1 numeric and sex following ISO/IEC 5218. But so far I have not found a similar standard for languages, ISO 639 does not appear to have a matching list of numeric codes.
What would be the right way to do this? Should I just use the String codes?
Thanks!

If you're a fan of the numbers, you can use country calling codes, although they "only" represent the ITU members (193 countries according to Wikipedia). But hey, they have Somalia and Palestine, so that's a good hint about how global this is.
However, storing everything in an encoded format (numbers here) implies a decoding step on the fly when any piece of data is requested (with translation tables stored in RAM instead of DB's ROM). Probably on the server whose CPU is precious, but you might have deported the issue on the client, overworking the precious, time-critical server-client link in the process.
So, back in the 90s, when a 40MB HDD was expensive, that might have been interesting. Today, the cost of storing data vs. the cost of processing data is not on the same side of 1... Not counting the time it takes you to think and implement the transformations. All being said "IMHO", I think this level of efficiency actually kills efficiency. ;)
EDIT: Oops, just realized I misthought (does that verb even exist?) the country/language issue. Countries you have sorted out already, my bad. I know no numbered list of languages. However, the second part of the post might still be relevant...

If you are after raw performance and/or want to achieve really small data sizes, I would suggest you use either the three-letter (higher granularity) or the two-letter (lower granularity) codes from IOC ISO-639-1/2.
To my knowledge, there's no helper or anything for this standard built into any programming language that I know, so you'd need to build your own translator (code<->full name) which, however, should be trivial.
And as others already mentioned, you have to assess the cost involved with this (e.g. not being able to simply look at the data and understand it right away anymore) for yourself. I personally do recommend keeping data sizes small since BSON parsing and string operations are horribly expensive compared to dealing with numbers (or shorter strings for that matter). When dealing with small data sets, this won't make a noticeable difference. If, however, you need to churn through millions of documents or more optimizations like this can become mission critical.

Javascript: substring time complexity [duplicate]

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you

AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)

This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

How to test an MD5 implementation?

I am considering using a JS MD5 implementation.
But I noticed that there are only a few tests. Is there a good way of verifying that implementation is correct?
I know I can try it with a few different values and see if it works, but that only means it is correct for some inputs. I would like to see if it is correct for all inputs.

The corresponding RFC has a good description of the algorithm, an example implementation in C, and a handful of test values at the end. All three together let you make a good guess about the quality of the examined implementation and that's all you can get: a good guess.
Testing an applications with an infinite or at least a very large input set as a black box is hard, impossible even in most cases. So you have to check if the code implements the algorithm correctly. The algorithm is described in RFC-3121 (linked to above). This description is sufficient for an implementation. The algorithm itself is well known (in the scientific sense, i.e.: many papers have been written about it and many flaws have been found) and simple enough to skip the formal part, just inspect the implementation.
Problems to expect with MD5 in JavaScript: input of one or more zero bytes (you can check the one and two bytes long inputs thoroughly), endianess (should be no problem but easy to check) and the problem of the unsigned integer used for bit-manipulation in JavaScript (">>" vs. ">>>" but also easy to check for). I would also test with a handful of data with all bits set.
The algorithm needs padding, too, you can check it with all possible input of length shorter than the limit.
Oh, and for all of you dismissing the MD5-hash: it still has its uses as a fast non-cryptographic hash with a low collision-rate and a good mixing (some call the effect of the mixing "avalanche", one bit change in the input changes many bits in the output). I still use it for larger, non-cryptographic Bloom-filters. Yes, one should use a special hash fitting the expected input but constructing such a hash function is a pain in the part of the body Nature gave us to sit on.

Why does node.js suddenly use less memory?

I have a 25MB json file, that I "require" why my app starts up. Initial it seems that the node.js process takes up almost 200MB of memory.
But if I leave it running and come back to it, Activity monitor reports that it is using only 9MB which makes no sense at all! At the very least, it should be a few MB more, since even a simple node.js app that does almost nothing (acting like a server), uses 9MB.
The app seems to work fine - it is a server, that provides search suggestions form a word list of 220,000 words.
Is Activity Monitor wrong ?
Why is it using only 9MB, but initially used ~200MB when the application started up ?

Since it's JavaScript things that are no longer being used are removed via Garbage Collector(GC), freeing memory. Everything (or many things) may have been loaded into memory at the start. Then items that were not longer needed were removed from memory by the GC. Usually generation can take more memory in progress and lose some afterwards, for example temporary data-structures can be used in progress but are not longer needed when the process is done.
It's also possible that items in memory where swapped out and written to the disk temporally (and may be later retrieved), this swapping this is done by your OS and tends to be used more on programs that reserve a lot of memory.

How much memory it takes to load the file depends on a number of factors.
What text encoding is being used to store the file? JavaScript uses UTF-16 internally, so if that's not what's being used on disk, the size may be different. If the file is in UTF-32, for example, then the in-memory UTF-16 version will be smaller unless it's full of astrals. If the file is in UTF-8, then things are reversed: the in-memory version will be larger unless it's full of astrals. But for now, let's just assume that they're about the same size, either because they use the same encoding or the pattern of astrals just happens to make the file sizes more or less the same.
You're right that it takes at least 25MB to load the file (assuming that encodings don't interfere). The semantics of the JSON API being what they are, you need to have the whole file in memory as a string, so the app will take up at least that much memory at that time. That doesn't count whatever the parser needs to run, so you need at least 34MB: 25 for the file, 9 for Node, and then whatever your particular app uses for itself.
But your app doesn't need all of that memory all the time. Depending on how you've written the app, you're probably destroying your references to the file at some point.
Because of the semantics of JSON, there's no way to avoid loading the whole file into memory, which takes 25MB because that's the size of the file. There's also no way to avoid taking up whatever memory the JSON parser needs to do its work and build the object.
But depending on how you've written the app, there probably comes a point when you no longer need that data. Either you exit the function that you used to load the file, or you assign that variable to something else, or any of a number of other possibilities. However it happens, JavaScript reclaims memory that's not being used anymore. This is called garbage collection, and it's popular among so-called "scripting languages" (though other programming languages can use it too).
There's also the question of text representation versus in-memory representation. Strings require about the same amount of space in memory versus on-disk, unless you change the encoding, but Numbers and Booleans are another matter entirely. In JavaScript, all Numbers are 64-bit floating-point numbers, so if most of your numbers on disk are more than four characters long, then the in-memory representation will be smaller, possibly by quite a bit. Note that I said characters, not digits: it's true that digits are characters, but +, -, e, and . are characters too, so -1e0 takes up as twice as much space as -1 when written as text, even though they represent the same value in memory. As another example, 3.14 takes up as much space as 1000 as text (and happen to take up the same amount of space in memory: 64 bits each). But -0.00000001 and 100000000 take up much less space in memory than on disk, because the in-memory representation is smaller. Booleans can be even smaller: different engines store them in different ways, but you could theoretically do it in as little as one bit. That's a far cry from the 8 bytes it takes to store "true", or 10 to store "false".
So if your data is mostly about Numbers and Booleans, then the in-memory representation stands to get a lot smaller. If it's mostly Strings, then not so much.

Creating and parsing huge strings with javascript?

I have a simple piece of data that I'm storing on a server, as a plain string. It is kind of ridiculous, but it looks like this:
name|date|grade|description|name|date|grade|description|repeat for a long time
this string can be up to 1.4mb in size. The idea is that it's a bunch of student records, just strung together with a simple pipe delimeter. It's a very poor serialization method.
Once this massive string is pushed to the client, it is split along the pipes into student records again, using javascript.
I've been timing how long it takes to create, and split, these strings on the client side. The times are actually quite good, the slowest run I've seen on a few different machines is 0.2 seconds for 10,000 'student records', which has a final string size of ~1.4mb.
I realize this is quite bizarre, just wondering if there are any inherent problems with creating and splitting such large strings using javascript? I don't know how different browsers implement their javascript engines. I've tried this on the 'major' browsers, but don't know how this would perform on earlier versions of each.
Yeah looking for any comments on this, this is more for fun than anything else!
Thanks

String splitting for 1.4mb data is not a problem for decent machines, instead you should worry about the internet connection speed of your users. I've tried to do spell check with 800 kb dictionary (which is half of your data), main issue was loading time.
But looks like your students records data could be put in database, and might not need to load everything at loading time, So, how about do a pagination to show user records or use ajax to request to search certain user names?

If it's a really large string it may pay to continuously slice the string with 'string'.slice(from, to) to only process a smaller subset, appending all of the individual items to the end of the output with list.push() or something similar might work.
String split methods are probably the most efficient way of doing this though, even in IE. Processing individual characters using string.charAt(x) is extremely slow and will often show a security error as it stalls the browser. Using string split methods would certainly be much faster than splitting using regular expressions.
It may also be possible to encode the data using a JSON array, some newer browsers such as IE8/Webkit/FF3.5 have fast JSON parsing built in using JSON.parse(data). But using eval(JSON) may overflow the browser if there's enough data, so is probably a bad idea. It may pay to compare for performance though.
A much better approach in a lot of cases is to use AJAX and only load some of the data at once from the server, which would also save download time.

Besides S. Mark's excellent comments about local vs. x-fer speed and the tip to re-encode using AJAX, I suggest a (longterm) move away from JavaScript in the Browser (assuming that's were it runs) to either a non-browser implementation of JS (or possibly another language).
A browser based JS seems a week link in a data-x-fer chain and nothing I would want to run unmonitored, since the browsers are upgraded from time to time and breaking your JS-x-fer might be an unanticipates side effect!

Develop Reference

JavaScript is the programming language of the Web.