The question title almost says it all: do longer keys make for slower lookup?
Is:
someObj["abcdefghijklmnopqrstuv"]
Slower than:
someObj["a"]
Another sub-question is whether the type of the characters in the string used as key matters. Are alphanumeric key-strings faster?
I tried to do some research; there doesn't seem to be much info online about this. Any help/insight would be extremely appreciated.
In general no. In the majority of languages, string literals are 'interned', which hashes them and makes their lookup much faster. In general, there may be some discrepancies between different javascript engines, but overall if they're implemented well (cough IE cough), it should be fairly equal. Especially since javascript engines are constantly being developed, this is (probably) an easy thing to optimize, and the situation will improve over time.
However, some engines also have limits on the length of strings that are interned. YMMV on different browsers. We can also see some insight from the jsperf test (linked in comments for the question). Firefox obviously does much more aggressive interning.
As for the types of characters, the string is treated as just a bunch bytes no matter the charset, so that probably won't matter either. Engines might optimize keys that can be used in dot notation but I don't have any evidence for that.
The performance is the same if we are talking about Chrome which uses V8 javascript engine. Based on V8 design specifications you can see from "fast property access" and "Dynamic machine code generation" that in the end those keys end up being compiled as any other c++ class variables.
Related
I recently happened to think about object property access times in JavaScript and came across this question which seemed to reasonably suggest that it should be constant time. This also made me wonder if there is a limit on object property key lengths. Apparently modern browsers support key lengths of upto 2^30, which seem to be quite good for a hash function. That said,
Is anyone aware of the kind of hash functions that is used by JS
engines?
Is it possible to experimentally create collisions of JavaScript's
property accessors?
Is anyone aware of the kind of hash functions that is used by JS engines?
Yes, their developers are certainly aware of the hash functions and the problems they have. In fact, attacks based on hash collisions were demonstrated in 2011 against a variety of languages, among others as a DOS attack againt node.js servers.
The v8 team solved the issue, you can read about the details at https://v8project.blogspot.de/2017/08/about-that-hash-flooding-vulnerability.html.
Is it possible to experimentally create collisions of JavaScript's property accessors?
It appears so: https://github.com/hastebrot/V8-Hash-Collision-Generator
I have a script which is obfuscated and begins like this:
var _0xfb0b=["\x48\x2E\x31\x36\x28\x22\x4B\x2E
...it continues like that for more then 435.000 chars (the file has 425kB) and in the end this is coming:
while(_0x8b47x3--){if(_0x8b47x4[_0x8b47x3]){_0x8b47x1=_0x8b47x1[_0xfb0b[8]](
new RegExp(_0xfb0b[6]+_0x8b47x5(_0x8b47x3)+_0xfb0b[6],_0xfb0b[7]),
_0x8b47x4[_0x8b47x3]);} ;} ;return _0x8b47x1;}
(_0xfb0b[0],62,2263,_0xfb0b[3][_0xfb0b[2]](_0xfb0b[1])));
My question is: Isn't it way harder for a browser to execute that compared to a not-obfuscated script and if so, how much time I'm probably loosing because of the obfuscation? Especially the older browsers like IE6 which are really not that performant in JS must spend a lot more time on that, right?
It certainly does slow down the browser more significantly on older browsers (specifically when initializing), but it definitely slows it down even afterwards. I had a heavily obfuscated file that took about 1.2 seconds to initialize, unobfuscated in the same browser and PC was about 0.2 seconds, so, significant.
It depends on what the obfuscator does.
If it primarily simply renames identifiers, I would expect it to have little impact on performance unless the identifier names it used were artificially long.
If it scrambles control or data flow, it could have arbitrary impact on code execution.
Some control flow scrambling can be done with only constant overhead.
You'll have to investigate the method of obfuscation to know the answer to this. Might be easier to just measure the difference.
The obfuscation you're using seems to just store all string constants into one array and put them into the code where they originally were. The strings are obfuscated into the array but still come out as string. (Try console.log(_0xfb0b) to see what I mean).
It does, definitely, slow down the code INITIALIZATION. However, once that array has been initialized, the impact on the script is negligible.
Enumerating the keys of javascript objects replays the keys in the order of insertion:
> for (key in {'z':1,'a':1,'b'}) { console.log(key); }
z
a
b
This is not part of the standard, but is widely implemented (as discussed here):
ECMA-262 does not specify enumeration order. The de facto standard is to match
insertion order, which V8 also does, but with one exception:
V8 gives no guarantees on the enumeration order for array indices (i.e., a property
name that can be parsed as a 32-bit unsigned integer).
Is it acceptable practice to rely on this behavior when constructing Node.js libraries?
Absolutely not! It's not a matter of style so much as a matter of correctness.
If you depend on this "de facto" standard your code might fail on an ECMA-262 5th Ed. compliant interpreter because that spec does not specify the enumeration order. Moreover, the V8 engine might change its behavior in the future, say in the interest of performance, e.g.
Definitely do not rely on the order of the keys. If the standard doesn't specify an order, then implementations are free to do as they please. Hash tables often underlie objects like these, and you have no way of knowing when one might be used. Javascript has many implementations, and they are all competing to be the fastest. Key order will vary between implementations, if not now, then in the future.
No. Rely on the ECMAScript standard, or you'll have to argue with the developers about whether a "de facto standard" exists like the people on that bug.
It's not advised to rely on it naively.
You should also do your best to stick to the spec/standard.
However there are often cases where the spec or standard limits what you can do. I'm not sure in programming I've encountered many implementations that deviate or extend the specification often for reasons such as the specification doesn't cater to everything.
Sometime people using specifics of an implementation might have test cases for that, though it's hard to make a reliable test case for beys being in order. It most succeed by accident or rather it's difficult behavior to reliably produce.
If you do rely on an implementation specific then you must document that. If your project requires portability (code to run on other people's setups out of your control and you want maximum compatibility) then in this case it's not a good idea to rely on an implementation specific such as key order.
Where you do have full control of the implementation being used then it's entirely up to you which implementation specifics you use while keeping in mind you may be forced to cater to portability due to the common need or desire to upgrade implementation.
The best form of documentation for cases like this is inline, in the code itself, often with the intention of at least making it easy to identify areas to be changed should you switch from an implementation guaranteeing order to one not doing so.
You can make up the format you like but it can be something like...
/** #portability: insertion_ordered_keys */
for(let key in object) console.log();
You might even wrap such cases up in code:
forEachKeyInOrderOfInsertion(object, console.log)
Again, likely something less overly verbose but enough to identify cases dependent on that.
For where your implementation guarantees key order you're just trans late that to the same as the original for.
You can use a JS function for that with platform detection, templating like CPP, transpiling, etc. You might also want to wrap the object creation and to be very careful about things crossing boundaries. If something loses order before reaching you (like JSON decode of input from a client over the network) then you'll likely not have a solution to that solely withing your library, this can even be just if someone else is calling your library.
Though you'll likely not need those, just make cases where you do something that might break later as a minimum and document that potential exists.
An obvious exception to that is if the implementation guarantees consistency. In that case you will probably be wasting your time decorating everything if it's not really a variability and is already documented via the implementation. The implementation often is a spec or has its own, you can choose to stick to that rather than a more generalised spec.
Ultimately in each case you'll need to make a judgement call, you may also choose to take a chance. As long as you're fully aware of the potential problems including the potential of wasting time avoiding problems you wont necessarily actually have, that is you know all the stakes and have considered your circumstances, then it's up to you what to do. There's no "should" or "shouldn't", it's case specific.
If you're making node.js public libraries or libraries to be widely distributed beyond the scope of your control then I'd say it's not good to rely on implementation specifics. Instead at least have a disclaimer with the release notes that the library is only catering to your stack and that if people want to use it for others then can fix and put in a pull request. Otherwise if not documented, it should be fixed.
I am searching for creative ways to obfuscate my JS code, so the users couldn't "beautify" it in less than 1 hour.
To be specific, I have an array, values of which I need to hide from the users, who understand JS, for one hour at least. At this point, I am going to use ascii codes and Caesar cipher, for example. Any more creative ideas would be appreciated.
You could use a hashing algorithm, and only store the hashed result of the correct answer. To compare the entered answer to the correct answer, hash the entered answer and compare the hash codes. Although not completely safe, it will take some serious time to crack.
This of course requires that there are a lot of possible answers. For a question like "Which year was N.N. born?", you could easily brute force every possible answer in less than a second.
You could also add a series of noise patterns that aren't used in the contents of the array and eliminate them with various regexps. But why on earth would you want to do this? Please tell me that this isn't for a production environment. Obscurity is not security.
I would suggest extensive use of the comma operator, confusing shortcircuits, loops, prefix and postfix incrementing that doesn't do what it looks like it should, and making really good use of the way the assignment operator is evaluated left-to-right. For example:
j=(a=[],i=0);while(i-10)a[i++]=++j+i;
Would give 2, 4, 6, 8, 10, 12, 14, 16, 18, 20. It might not seem that confusing but can really be great depending on the size of the code.
The best way I know is to use the Closure Compiler in Advanced Mode (NOT Simple Mode).
A JavaScript program processed by the Closure Compiler in Advanced Mode is next to impossible to reverse-engineer, even after passing through a beautifier.
The downside is that there are many many restrictions when using Advanced Mode. For those who can use it, it is definitely worth it.
I personally use the Dojo Toolkit and have my mobile programs optimized by the Closure Compiler in Advanced Mode. It makes the resulting files around 25% smaller than simple minification (which can be beaten by a beautifier) due to dead-code removal, inlining, virtualization, namespace flattening etc. Performance is also enhanced because of the vigorous optimizations. I regularly put the resulting files through numerous beautifiers to make sure that they cannot be easily reversed-engineered.
I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a simple parser that remembered state by pushing/popping off of a stack, but things kept escalating until I had a recursive descent parser on my hands.
Soon after, I compared the performance of all my previous parsing methods. The recursive descent parser was by far the slowest. I'm stuck: Is it worth using a recursive descent parser for something simple, or am I justified in taking shortcuts? I would love to go the pure regex route, which is insanely fast (almost 3 times faster than the RD parser), but is very hacky and unmaintainable to a degree. I suppose performance isn't terribly important because compiled templates are cached, but is a recursive descent parser the right tool for every task? I guess my question could be viewed as more of a philosophical one: to what degree is it worth sacrificing maintainability/flexibility for performance?
Recursive descent parsers can be extremely fast.
These are usually organized with a lexer, that uses regular expressions to recognize langauge tokens that are fed to the parser. Most of the work in processing the source text is done character-by-character by the lexer using the insanely fast FSAs that the REs are often compiled into.
The parser only sees tokens occasionally compared to the rate at which the lexer sees characters, and so its speed often doesn't matter. However, when comparing parser-to-parser speeds, ignoring time required to lex the tokens, recursive descent parsers can be very fast because they implement the parser stack using function calls which are already very efficient compared to general parser push-current-state-on-simulated-stack.
So, you can have you cake and eat it, too. Use regexps for the lexemes. Use the parser (any kind, recursive descent are just fine) to process lexemes. You should be pleased with the performance.
This approach also satisifies the observation made by other answers: write it in a way to make it maintainable. Lexer/Parser separation does this very nicely, I assure you.
Readability first, performance later...
So if your parser makes code more readable then it is the right tool
to what degree is it worth sacrificing
maintainability/flexibility for
performance?
I think it's very important to write clear maintanable code as a first priority. Until your code not only indicates that it is a bottleneck, but that your application performance also suffers from it, you should always consider clear code to be the best code.
It's also important to not reinvent the wheel. The comment on taking a look at another parser is a very good one. There often found common solutions for writing routines such as this.
Recusion is very elegant when applied to something applicible. In my own experiance slow code due to recursion is an exception, not the norm.
A Recursive Descent Parser should be faster
...or you're doing something wrong.
First off, your code should be broken into 2 distinct steps. Lexer + Parser.
Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.
As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.
Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:
Raw input (ie source)
Parser state (ex isExpression, isSatement, row, col)
Data (Ex AST, Tree, 2D Array, etc).
For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.
To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.
Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.
The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.
Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.
The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.