I read article about how JavaScript Engine works, but there is thing that confusing me, it says the following:
JavaScript code first parsed
The source code translated to bytecode
The bytecode gets optimized
The code generator takes bytecode and translates into low level assembly code
Is last step true?
Does above how JavaScript Engines work like "V8"?
(V8 developer here.)
Yes, the JavaScript engines used in "modern" (since 2008) browsers have just-in-time compilers that compile JavaScript to machine code. That's probably what that article meant to say. If you want to distinguish the terms "assembly language" and "machine code", then the former would be the human-readable form (such as mov eax, ebx, written by humans and produced by disassemblers) and the latter would be the binary-encoded form that the CPU understands (such as 0x89 0xD8, produced by compilers/assemblers). I'd say that the term "assembly code" is sufficiently ambiguous that it could refer to either, or could imply that you don't want to distinguish.
I find the third step in your description more misleading: byte code is typically not optimized. The bytecode interpreter, if it exists, is usually the engine's first execution tier, and its purpose is to start execution as soon as possible, without first spending any time on optimizations. If a function runs hot enough, the engine will eventually decide to spend the time to optimize it to machine code (depending on the engine, possibly in a succession of several increasingly powerful but costly compilers). These later, optimizing tiers may or may not take the bytecode as input; alternatively they can parse the source again to build an AST (taking V8 as a specific example, it used to do the latter and is currently doing the former).
Side note: that article is pretty silly indeed. Example:
techniques like inlining (removing white space)
That's so wrong that it's outright funny :-D
Related
The American Fuzzy Lop, and the conceptually related LLVM libfuzzer not only generate random fuzzy strings, but they also watch branch coverage of the code under test and use genetic algorithms to try to cover as many branches as possible. This increases the hit frequency of the more interesting code further downstream as otherwise most of the generated inputs will be stopped early in some deserialization or validation.
But those tools work at native code level, which is not useful for JavaScript applications as it would be trying to cover the interpreter, but not really the interpreted code.
So is there a way to fuzz JavaScript (preferably in browser, but tests running in node.js would help too) with coverage guidance?
I looked at the tools mentioned in this old question, but those that do javascript don't seem to mention anything about coverage profiling. And while radamsa mentions optionally pairing it with coverage analsysis, I haven't found any documentation on how to actually do it.
How can one fuzz-test java-script (in browser) application with coverage guidance?
Fuzzing a JavaScript engine draws a lot of attention as the number of browser users is about 4 Billion. Several works have been done to find bugs in JS engines, including popular large engines, e.g, v8, webkit, chakracore, gecko, or some small embedded engines, like jerryscript, QuickJS, jsish, mjs, mujs.
It is really difficult to find bugs using AFL as the mutation mechanisms provided by AFL is not practical for JS files, e.g, bitflip can hardly be a valid mutation. Since JS is a structured language, several works using ECMAScript grammar to mutate/generate JS files(seeds):
LangFuzz parses sample JS files and splits them into code fragments. It then recombines the fragments to produce test cases.
jsfunfuzz randomly generates syntactically valid JS statements from JS grammar manually written for fuzzing.
Dharma is a generation-based, context-free grammar fuzzer, generating files based on given grammar.
Superion extends AFL using tree-based mutation guided by JS grammar.
The above works can easily pass the syntax checks but fail at semantic checks. A lot of generated JS seeds are semantically invalid.
CodeAlchemist uses a semantics-aware approach to generate code segments based on a static type analysis.
There are two levels of bugs related to JS engines: simple parser/interpreter bugs and deep inside logic bugs. Recently, there is a trend that the number of simple bugs decreases while more and more deep bugs come out.
DIE uses aspect-preserving mutation to preserves the desirable properties of CVEs. It also using type analysis to generate semantic-valid bugs.
Some works focus on mutating intermediate representations.
Fuzzilli is a coverage-guided fuzzer based on mutation on the IR level. The mutations on IR can guarantee semantic validity and can be transferred to JS.
Fuzzing JS is an interesting and hot topic according to the top conference of security/SE in recent years. Hope this information is helpful.
On https://v8.dev/docs/ignition we can see that:
Ignition is a fast low-level register-based interpreter written using the backend of TurboFan
on https://docs.google.com/document/d/11T2CRex9hXxoJwbYqVQ32yIPMh0uouUZLdyrtmMoL44/edit?ts=56f27d9d#
The aim of the Ignition project is to build an interpreter for V8 which executes a low-level bytecode, thus enabling run-once or non-hot code to be stored more compactly in bytecode form.
The interpreter itself consists of a set of bytecode handler code snippets, each of which handles a specific bytecode and dispatches to the handler for the next bytecode. These bytecode handlers
To compile a function to bytecode, the JavaScript code is parsed to generate its AST (Abstract Syntax Tree). The BytecodeGenerator walks this AST and generates bytecode for each of the AST nodes as appropriate.
Once the graph for a bytecode handler is produced it is passed through a simplified version of Turbofan’s pipeline and assigned to the corresponding entry in the interpreter table.
So it seems that Ignition job is to take bytecode generated by BytecodeGenerator convert it to bytecode handlers and execute it through Turbofan.
But here:
and here:
You can see that it is ignition that produces bytecode.
What is more, in this video https://youtu.be/p-iiEDtpy6I?t=722 Ignition is said to be a baseline compiler.
So what's it?
A baseline compiler? A bytecode interpreter? An AST to bytecode transformer?
This image seems to be most appropriate:
where ignition is just an interpreter and everything before is no-name bytecode generator/optimizer thing.
V8 developer here.
On https://v8.dev/docs/ignition we can see that:
Ignition is a fast low-level register-based interpreter written using the backend of TurboFan
Yes, that sums it up. To add a little more detail:
The name "Ignition" refers to both the Bytecode Generator and the Bytecode Interpreter. Often, the entire thing is also seen as one big black box and casually called "the interpreter", which can sometimes lead to a bit of confusion around the terms.
The Bytecode Generator takes the AST produced by the Parser for a given JavaScript function, and generates bytecode from it.
The Bytecode Interpreter takes the bytecode generated by the Bytecode Generator and executes it by interpreting it by sending it to a set of Bytecode Handlers.
The Bytecode Handlers that make up the Bytecode Interpreter are generated using parts of the Turbofan pipeline. This happens at V8 compilation time, not at runtime. In other words, you need Turbofan to build (parts of) Ignition, but not to run Ignition.
The Parser (and the AST/Abstract Syntax Tree it produces are not part of Ignition.
Once the graph for a bytecode handler is produced it is passed through a simplified version of Turbofan’s pipeline and assigned to the corresponding entry in the interpreter table.
So it seems that Ignition job is to take bytecode generated by BytecodeGenerator convert it to bytecode handlers and execute it through Turbofan
This section of the design document talks about generating the Bytecode Handlers, which happens "ahead of time" (i.e. when V8 is compiled) using parts of Turbofan. At runtime, bytecode is not converted to handlers, it is "handled" (=run, executed, interpreted) by the existing fixed set of handlers, and Turbofan is not involved.
What is more, in this video https://youtu.be/p-iiEDtpy6I?t=722 Ignition is said to be a baseline compiler.
At that moment, the talk is referring to the general idea that all modern JavaScript engines have a "baseline compiler" (in a very general, conceptual sense -- I agree that the slide could have made that clearer). Note that the slide does not say anything about Ignition. The next slide says that Ignition fills that role in V8. So more accurate would be to say "Ignition takes the place of a baseline compiler" or "Ignition is a baseline execution engine". Or you could redefine your terms slightly and say "Ignition is a compiler that produces bytecode and then interprets it".
ignition is just an interpreter and everything before is no-name bytecode generator/optimizer thing
That slide shows the "Interpreter" box as part of the "Ignition Bytecode Pipeline". The Bytecode Generator/Optimizer are also part of Ignition.
As I mentioned in a comment, sadly some of the docs are out of date, including the one with your first graphic above. Full-codegen and Crankshaft are no longer used at all, it's purely parsing and Ignition + TurboFan. (you've removed the image from the outdated docs that sadly are still linked by some of the V8 docs)
Ignition is a high-speed bytecode interpreter.
V8's parser produces Ignition bytecode. That bytecode is executed (interpreted) by Ignition. Code that only runs once (startup code and such) or isn't run often stays at the bytecode level and continues to be executed by Ignition.
"Hot" code goes to the second phase, where TurboFan kicks in: TurboFan's input is the same bytecode that Ignition interprets (rather than source code, as it was with Crankshaft), which it then aggressively compiles to highly-optimized machine code that runs directly (rather than being interpreted).
This article goes into the motivations for moving off Full-codegen and Crankshaft (memory savings in the former case, difficulty implementing and in particular optimizing language features in the second). The design of TurboFan also helps the V8 authors minimize the amount of platform-specific code they have to write (by having an intermediate representation, which amongst other things they can also use to write Ignition's bytecode handlers).
Please correct my mistake in understanding the conversion process:
JS code parsed into AST
AST sent to Ignition to be converted into bytecode
Generate Machine Code
If byte code is optimal, Send it to the CPU for processing
If byte code needs optimization, it is sent to Turbofan for optimization
If Turbofan received byte code, it further optimizes, and sent to computer for processing.
To improve performance JavaScript engines sometimes only fully parse functions when they are actually called.
For example, from the Spidermonkey source code:
Checking the syntax of a function is several times faster than doing a full parse/emit, and lazy parsing improves both performance and memory usage significantly when pages contain large amounts of code that never executes (which happens often).
What steps can the parser skip while still being able to validate the syntax?
It appears that in Spidermonkey some of the savings come from not emitting bytecode, like after a full parse. Does a full parse in e.g. V8 also include generating machine code?
First off a clarification: the two steps are called "pre-parsing" and "full parsing". "Lazy parsing" describes the strategy of doing the former first, and then the latter when needed.
The two big reasons why pre-parsing is faster are:
it doesn't build the AST (abstract syntax tree), which is usually the output of the parser
it doesn't do scope resolution for variables.
There are a few other internal steps done by the parser in preparation for code generation that aren't needed when only checking for errors, but the above two are the major points.
(Full) parsing and code generation are separate steps.
I have read the question How to test and develop with asm.js?, and the accepted answer gives a link to http://kripken.github.com/mloc_emscripten_talk/#/.
The conclusion of that slide show is that "Statically-typed languages and especially C/C++ can be compiled effectively to JavaScript", so we can "expect the speed of compiled C/C++ to get to just 2X slower than native code, or better, later this year".
But what about non-statically-typed languages, such as regular JavaScript itself? Can it be compiled to asm.js?
Can JavaScript itself be compiled to asm.js?
Not really, because of its dynamic nature. It's the same problem as when trying to compile it to C or even to native code - you actually would need to ship a VM with it to take care of those non-static aspects. At least, such a VM is possible:
js.js is a JavaScript interpreter in JavaScript. Instead of trying to create an interpreter from scratch, SpiderMonkey is compiled into LLVM and then emscripten translates the output into JavaScript.
But if asmjs code runs faster than regular JS, then it makes sense to compile JS to asmjs, no?
No. asm.js is a quite restricted subset of JS that can be easily translated to bytecode. Yet you first would need to break down all the advanced features of JS to that subset for getting this advantage - a quite complicated task imo. But JavaScript engines are designed and optimized to translate all those advanced features directly into bytecode - so why bother about an intermediate step like asm.js? Js.js claims to be around 200 times slower than "native" JS.
And what about non-statically-typed languages in general?
The slideshow talks about that from …Just C/C++? onwards. Specifically:
Dynamic Languages
Entire C/C++ runtimes can be compiled and the original language
interpreted with proper semantics, but this is not lightweight
Source-to-source compilers from such languages to JavaScript ignore
semantic differences (for example, numeric types)
Actually, these languages depend on special VMs to be efficient
Source-to-source compilers for them lose out on the optimizations done in those VMs
In response to the general question "is it possible?" then the answer is that sure, both JavaScript and the asm.js subset are Turing complete so a translation exists.
Whether one should do this and expect a performance benefit is a different question. The short answer is "no, you shouldn't." I liken this to trying to compress a compressed file; yes, it is possible to run the compression algorithm, but in general you should not expect the resulting file to be smaller.
The short answer: The performance cost of dynamically-typed languages comes from the meaning of the code; a statically-typed program with an equivalent meaning would carry the same costs.
To understand this, it is important to understand why asm.js offers a performance benefit at all; or, more generally, why statically-typed languages perform better than dynamically-typed ones. The short answer is "run-time type checking takes time," and a longer answer would include the improved feasibility of optimizing statically-typed code. For example:
function a(x) { return x + 1; }
function b(x) { return x - 1; }
function c(x, y) { return a(x) + b(y); }
If x and y are both known to be integers, I can optimize function c to a couple of machine code instructions. If they could be integers or strings, the optimization problem becomes much harder; I have to treat these as string appends in some cases, and addition in other cases. In particular, there are four possible interpretations of the addition operation that occurs in c; it could be addition, or string append, or two different variants of coerce-to-string-and-append. As you add more possible types, the number of possible permutations grows; in the worst case for a dynamically-typed language, you have k^n possible interpretations of an expression involving n terms which could each have any number of k types. In a statically typed language, k=1, so there is always 1 interpretation of any given expression. Because of this, optimizers are fundamentally more efficient at optimizing statically-typed code than dynamically-typed code: There are fewer permutations to consider when searching for opportunities to optimize.
The point here is that when converting from dynamically-typed code to statically-typed code (as you'd be doing when going from JavaScript to asm.js), you have to account for the semantics of the original code. Meaning the type-checking still occurs (it's just now been spelled out statically-typed code) and all those permutations are still present to stifle the compiler.
A few facts about asm.js, which hopefully make the concept clear:
Yes you can write the asm.js dialect by hand.
If you did look at the examples for asm.js, they are very far from being user friendly. Obviously Javascript is not the front end language for creating this code.
Translating vanilla Javascript to asm.js dialect is not possible.
Think about it - if you already could translate standard Javascript in a fully statically manner, why would there be a need for asm.js? The sole existance of asm.js means that the Javascript JIT people at some people gave up on their promise that Javascript will get faster without any effort from the developer.
There are several reasons for this, but let's just say it would be really hard for the JIT to understand a dynamic language as good as a static compiler. And then probably for the developers to fully understand the JIT.
In the end it boils down to using the right tool for the task. If you want static, very performant code, use C / C++ ( / Java ) - if you want a dynamic language, use Javascript, Python, ...
asm.js has been created by the need of have a small subset of javascript which can be easily optimized. If you can have a way to convert javascript to javascript/asm.js, asm.js is not needed anymore - that method can be inserted in js interpreters directly.
In theory, it is possible to convert / compile / transpile any JavaScript operation to asm.js if it can be expressed with the limited subset of the language present in asm.js. In practice, however, there is no tool capable of converting ordinary JavaScript to asm.js at the moment (June, 2017).
Either way, it would make more sense to convert a language with static typing to asm.js, because static typing is a requirement of asm.js and the lack thereof one of the features of ordinary JavaScript that makes it exceptionally hard to compile to asm.js.
Back in 2013, when asm.js was hot, there has been an attempt to compile a statically typed language similar to JavaScript, but both the language and the compiler seem to have been abandoned.
Today, in 2017, JavaScipt subsets TypeScript and Flow would be suitable candidates for conversion to asm.js, but the core dev teams of neither language is interested in such conversion. LLJS had a fork that could compile to asm.js, but that project is pretty much dead. ThinScript is a much more recent attempt and is based on TypeScript, but it doesn't appear to be active either.
So, the best and easiest way to produce asm.js code is still to write your code in C/C++ and convert / compile / transpile it. However, it remains to be seen whether we'll even want to do this in the forseeable future. Web Assembly may soon replace asm.js altogether and there's already popping up TypeScript-like languages like TurboScript and AssemblyScript that convert to Web Assembly. In fact, TurboScript was originally based on ThinScript and used to compile to asm.js, but they appear to have abandoned this feature.
It may be possible to convert regular JavaScript to asm.js by first compiling it to C or C++, and then compiling the generated code to asm.js using Emscripten. I'm not sure if this would be practical, but it's an interesting concept nonetheless.
There is also a compiler called NectarJS that compiles JavaScript to WebAssembly and ASM.js.
check this http://badassjs.com/post/43420901994/asm-js-a-low-level-highly-optimizable-subset-of
basically you need check that your code would be asm.js compatible (no coercion or type casting, you need to manage the memory, etc). The idea behind this is write your code in javascript, detect the bottle neck and do the changes in your code for use asm.js and aot compilation instead jit and dynamic compilation...is a bit PITA but you can still use javascript or other languages like c++ or better..in a near future, lljs.....
Lua is small and can be easily embedded. The current JavaScript VMs are quite big and hard to integrate into existing applications.
So wouldn't it be possible to compile JavaScript to Lua or Lua bytecode?
Especially for the constraints in mobile applications this seems like a good fit. Being able to easily integrate one of the most popular scripting languages into any iPhone or Android app would be great.
I'm not very familiar with Lua so I don't know if this is technically feasible.
With Luvit there is an active project trying to port the Node.js architecture to Lua. So the evented JavaScript world can't be too far away from whats possible in Lua.
The wins of compiling Javascript to Lua are not as great as you might first imagine. Javascript's semantics are very different to Lua's (the LuaJIT author cites Lua's design as one of the main reasons LuaJIT can compete so favourably with Javascript JIT compilers).
Take this code:
if("1" == 1)
{
print("Yes");
}
This prints "Yes" in Javascript. The equivalent code in Lua does not, as strings are never equal to numbers in Lua. This may seem like a small point, but it has a fundamental consequence: we can no longer use Lua's built-in equality testing.
There are two solutions we could take. We could rewrite 1 == "1" to javascript_equals(1, "1"). Or we could wrap every Javascript value in Lua, and use Lua's metatables to override the == operator behaviour.
So we already lost a some efficiency from Lua by mapping Javascript to it. This is a simple example, but it continues like this all the way down. For example all the operator rules are different between Javascript and Lua.
We would even have to wrap Javascript objects, because they aren't the same as Lua tables. For example Javascript objects only support string keys, and coerce any index to a string:
> a = {}
{}
> a[1] = "Hello"
'Hello'
> a["1"]
'Hello'
You also have to watch out for Javascript's scoping rules, vararg functions, and so on.
Now, all of these things are surmountable, if someone were to put the effort into a full compiler. However any efficiency gains would soon be drowned out. You would essentially end up building a Javascript interpreter in Lua. Most Javascript interpreters are written in C and already optimised for Javascript's semantics.
So, doing it for efficiency is a lost cause. There may be other reasons - such as supporting Javascript in a Lua-only environment, though even then if possible just writing Lua bindings to an existing Javascript interpreter would probably be less work.
If you want to have a play with a Javascript->Lua source-to-source translator, take a look at js2lua, which is a toy project I created some time back. It's not anywhere complete, but playing with it would certainly give some food for thought. It already includes a Javascript lexer, so that hard work is done already.