Im trying to understand how v8 works but im unable to locate where in the code does it actually get the input raw js script to parse it and compile it into c++.
I've seen the api.cc and tried to set up a breakpoint in the compiler function but with no luck (im using chromium to do so), it never hits this function.
MaybeLocal<Script> ScriptCompiler::Compile(Local<Context> context,
Source* source,
CompileOptions options,
NoCacheReason no_cache_reason)
***** UPDATE ****
After #jmrk reply I've been trying to figure out where does the JS actually start coming in, what im really interested in is understanding how a website renders and then passes the script into the V8 for it to compile.
I have found quite a lot of information on the topic but im still unable to understand the whole picture:
Turns out the first step isn't the Parser but the Scanner, which gets a UTF-16 stream as an input.
The source code is first broken up in chunks; each chunk may be
associated with a different encoding. A stream then unifies all chunks
under the UTF-16 encoding.
Prior to parsing, the scanner then breaks up the UTF-16 stream into
tokens. A token is the smallest unit of a script that has semantic
meaning. There are several categories of tokens, including whitespace
(used for automatic semicolon insertion), identifiers, keywords, and
surrogate pairs (combined to make identifiers only when the pair is
not recognized as anything else). These tokens are then fed first to
the preparser and then to the parser.
https://blog.logrocket.com/how-javascript-works-optimizing-for-parsing-efficiency/
I have also found out it indeed gets this stream from Blink:
he UTF16CharacterStream provides a (possibly buffered) UTF-16 view over the underlying Latin1, UTF-8, or UTF-16 encoding that V8 receives from Chrome, which Chrome in turn received from the network. In addition to supporting more than one encoding, the separation between scanner and character stream allows V8 to transparently scan as if the entire source is available, even though we may only have received a portion of the data over the network so far.
https://v8.dev/blog/scanner
It also seems like the scanner feeds tokens to the parser:
V8’s parser consumes ‘tokens’ provided by the ‘scanner’. Tokens are
blocks of one or more characters that have a single semantic meaning:
a string, an identifier, an operator like ++. The scanner constructs
these tokens by combining consecutive characters in an underlying
character stream.
But the question remains, where is the Javascript raw code coming in from blink into V8?
How can I see what chrome reads and where does it initialize v8?
It's complicated :-)
ScriptCompiler::Compile is generally correct as the outermost entrypoint. Note that there are two overloads of it. Additionally, Chrome tries to do streaming compilation when it can, which takes a different path. Also, when working with Chrome/Chromium, note that you have to set the breakpoints in the renderer processes, not the browser process.
It's easier to work with the d8 shell when poking around V8. Look for Shell::ExecuteString (which calls ScriptCompiler::Compile) in d8.cc.
Also, to clarify, V8 does not compile JavaScript to C++. It compiles it first to its own internal bytecode format which is executed by the "Ignition" interpreter; hot functions are then later compiled to machine code by the "Turbofan" optimizing compiler.
Don't be discouraged if you have trouble understanding the whole pipeline. No single person does; V8 is too big and too complicated for that. Focus on what you're interested in (parser? interpreter? optimizing compiler?) and dig into that.
Related
The American Fuzzy Lop, and the conceptually related LLVM libfuzzer not only generate random fuzzy strings, but they also watch branch coverage of the code under test and use genetic algorithms to try to cover as many branches as possible. This increases the hit frequency of the more interesting code further downstream as otherwise most of the generated inputs will be stopped early in some deserialization or validation.
But those tools work at native code level, which is not useful for JavaScript applications as it would be trying to cover the interpreter, but not really the interpreted code.
So is there a way to fuzz JavaScript (preferably in browser, but tests running in node.js would help too) with coverage guidance?
I looked at the tools mentioned in this old question, but those that do javascript don't seem to mention anything about coverage profiling. And while radamsa mentions optionally pairing it with coverage analsysis, I haven't found any documentation on how to actually do it.
How can one fuzz-test java-script (in browser) application with coverage guidance?
Fuzzing a JavaScript engine draws a lot of attention as the number of browser users is about 4 Billion. Several works have been done to find bugs in JS engines, including popular large engines, e.g, v8, webkit, chakracore, gecko, or some small embedded engines, like jerryscript, QuickJS, jsish, mjs, mujs.
It is really difficult to find bugs using AFL as the mutation mechanisms provided by AFL is not practical for JS files, e.g, bitflip can hardly be a valid mutation. Since JS is a structured language, several works using ECMAScript grammar to mutate/generate JS files(seeds):
LangFuzz parses sample JS files and splits them into code fragments. It then recombines the fragments to produce test cases.
jsfunfuzz randomly generates syntactically valid JS statements from JS grammar manually written for fuzzing.
Dharma is a generation-based, context-free grammar fuzzer, generating files based on given grammar.
Superion extends AFL using tree-based mutation guided by JS grammar.
The above works can easily pass the syntax checks but fail at semantic checks. A lot of generated JS seeds are semantically invalid.
CodeAlchemist uses a semantics-aware approach to generate code segments based on a static type analysis.
There are two levels of bugs related to JS engines: simple parser/interpreter bugs and deep inside logic bugs. Recently, there is a trend that the number of simple bugs decreases while more and more deep bugs come out.
DIE uses aspect-preserving mutation to preserves the desirable properties of CVEs. It also using type analysis to generate semantic-valid bugs.
Some works focus on mutating intermediate representations.
Fuzzilli is a coverage-guided fuzzer based on mutation on the IR level. The mutations on IR can guarantee semantic validity and can be transferred to JS.
Fuzzing JS is an interesting and hot topic according to the top conference of security/SE in recent years. Hope this information is helpful.
To improve performance JavaScript engines sometimes only fully parse functions when they are actually called.
For example, from the Spidermonkey source code:
Checking the syntax of a function is several times faster than doing a full parse/emit, and lazy parsing improves both performance and memory usage significantly when pages contain large amounts of code that never executes (which happens often).
What steps can the parser skip while still being able to validate the syntax?
It appears that in Spidermonkey some of the savings come from not emitting bytecode, like after a full parse. Does a full parse in e.g. V8 also include generating machine code?
First off a clarification: the two steps are called "pre-parsing" and "full parsing". "Lazy parsing" describes the strategy of doing the former first, and then the latter when needed.
The two big reasons why pre-parsing is faster are:
it doesn't build the AST (abstract syntax tree), which is usually the output of the parser
it doesn't do scope resolution for variables.
There are a few other internal steps done by the parser in preparation for code generation that aren't needed when only checking for errors, but the above two are the major points.
(Full) parsing and code generation are separate steps.
I'm just wondering is there a difference in performance using removing spaces before and after equal signs. Like this two code snippets.
first
int i = 0;
second
int i=0;
I'm using the first one, but my friend who is learning html/javascript told me that my coding is inefficient. Is it true in html/javascript? And is it a huge bump in the performance? Will it also be same in c++/c# and other programming languages? And about the indent, he said 3 spaces is better that tab. But I already used to code like this. So I just want to know if he is correct.
Your friend is a bit misguided.
The extra spaces in the code will make a small difference in the size of the JS file which could make a small difference in the download speed, though I'd be surprised if it was noticeable or meaningful.
The extra spaces are unlikely to make a meaningful difference in the time to parse the file.
Once the file is parsed, the extra spaces will not make any difference in execution speed since they are not part of the parsed code.
If you really want to optimize download or parse speed, the way to do that is to write your code in the most readable fashion possible for best maintainability and then use a minimizer for the deployed code and this is a standard practice by many web sites. This will give you the best of both worlds - maintainable, readable code and minimum deployed size.
A minimizer will remove all unnecessary spacing, shorten the names of variables, remove comments, collapse lines, etc... all designed to make the deployed code as small as possible without changing the run-time meaning of the code at all.
C++ is a compiled language. As such, only the compiler that the developer uses sees any extra spaces (same with comments). Those spaces are gone once the code has been compiled into native code which is what the end-user gets and runs. So, issues about spaces between elements in a line are simply not applicable at all for C++.
Javascript is an interpreted language. That means the source code is downloaded to the browser and the browser then parses the code at runtime into some opcode form that the interpreter can run. The spaces in Javascript will be part of the downloaded code (if you don't use a minimizer to remove them), but once the code is parsed, those extra spaces are not part of the run-time performance of the code. Thus, the spaces could have a small influence on the download time and perhaps an even smaller influence on the parse time (though I'm guessing unlikely to be measurable or meaningful). As I said above, the way to optimize this for Javascript is to use spaces to enhance readability in the source code and then run a minimizer over the code to generate a deployed version of the code to minimize the deployed size of the file. This preserves maximum readability and minimizes download size.
There is little (javascript) to no (c#, c++, Java) difference in performance. In the compiled languages in particular, the source code compiles to the exact same machine code.
Using spaces instead of tabs can be a good idea, but not because of performance. Rather, if you aren't careful, use of tabs can result in "tab rot", where there are tabs in some places and spaces in others, and the indentation of the source code depends on your tab settings, making it hard to read.
How would I go about writing a lightweight javascript to javascript parser. Something simple that can convert some snippets of code.
I would like to basically make the internal scope objects in functions public.
So something like this
var outer = 42;
window.addEventListener('load', function() {
var inner = 42;
function magic() {
var in_magic = inner + outer;
console.log(in_magic);
}
magic();
}, false);
Would compile to
__Scope__.set('outer', 42);
__Scope__.set('console', console);
window.addEventListener('load', constructScopeWrapper(__Scope__, function(__Scope__) {
__Scope__.set('inner', 42);
__Scope__.set('magic',constructScopeWrapper(__Scope__, function _magic(__Scope__) {
__Scope__.set('in_magic', __Scope__.get('inner') + __Scope__.get('outer'));
__Scope__.get('console').log(__Scope__.get('in_magic'));
}));
__Scope__.get('magic')();
}), false);
Demonstation Example
Motivation behind this is to serialize the state of functions and closures and keep them synchronized across different machines (client, server, multiple servers). For this I would need a representation of [[Scope]]
Questions:
Can I do this kind of compiler without writing a full JavaScript -> (slightly different) JavaScript compiler?
How would I go about writing such a compiler?
Can I re-use existing js -> js compilers?
I don't think your task is easy or short given that you want to access and restore all the program state. One of the issues is that you might have to capture the program state at any moment during a computation, right? That means the example as shown isn't quite right; that captures state sort of before execution of that code (except that you've precomputed the sum that initializes magic, and that won't happen before the code runs for the original JavaScript). I assume you might want to capture the state at any instant during execution.
The way you've stated your problem, is you want a JavaScript parser in JavaScript.
I assume you are imagining that your existing JavaScript code J, includes such a JavaScript parser and whatever else is necessary to generate your resulting code G, and that when J starts up it feeds copies of itself to G, manufacturing the serialization code S and somehow loading that up.
(I think G is pretty big and hoary if it can handle all of Javascript)
So your JavaScript image contains J, big G, S and does an expensive operation (feed J to G) when it starts up.
What I think might serve you better is a tool G that processes your original JavaScript code J offline, and generates program state/closure serialization code S (to save and restore that state) that can be added to/replace J for execution. J+S are sent to the client, who never sees G or its execution. This decouples the generation of S from the runtime execution of J, saving on client execution time and space.
In this case, you want a tool that will make generation of such code S easiest. A pure JavaScript parser is a start but isn't likely enough; you'll need symbol table support to know which function code is connected a function call F(...), and which variable definition in which scope corresponds to assignments or accesses to a variable V. You may need to actually modify your original code J to insert points of access where the program state can be captured. You may need flow analysis to find out where some values went. Insisting all of this in JavaScript narrows your range of solutions.
For these tasks, you will likely find a program transformation tool useful. Such tools contain parsers for the langauge of interest, build ASTs representing the program, enable the construction of identifier-to-definition maps ("symbol tables"), can carry out modifications to the ASTs representing insertion of access points, or synthesis of ASTs representing your demonstration example, and then regenerate valid JavaScript code containing the modified J and the additions S.
Of all the program transformation systems that I know about (which includes all the ones at the Wikipedia site), none are implemented in JavaScript.
Our DMS Software Reengineering Toolkit is such a program transformation system offering all the features I just described. (Yes, its big and hoary; it has to be to handle the complexities of real computer languages). It has a JavaScript front end that contains a complete JavaScript parser to ASTs, and the machinery to regenerate JavaScript code from modified or synthesized ASTs. (Also big and hoary; good thing that hoary + hoary is still just hoary). Should it be useful, DMS also provides support for building control and dataflow analysis.
If you want something with a simple interface, you could try node-burrito: https://github.com/substack/node-burrito
It generates an AST using the uglify-js parser and then recursively walks the nodes. All you have to do is give a single callback which tests each node. You can alter the ones you need to change, and it outputs the resulting code.
I'd try to look for an existing parser to modify. Perhaps you could adapt JSLint/JSHint?
There is a problem with the rewriting above, you're not hoisting the initialization of magic to the top of the scope.
There's a number of projects out there that parse JavaScript.
Crock's Pratt parser which works well on JavaScript that fits within "The good parts" and less well on other JS.
The es-lab parser based on ometa which handles the full grammar including a lot of corner cases that Crock's parser misses. It may not perform as well as Crock's.
narcissus parser and evaluator. I don't have much experience with this.
There are also a number of high-quality lexers for JavaScript that let you manipulate JS at the token level. This can be tougher than it sounds though since JavaScript is not lexically regular, and predicting semicolon insertion is difficult without a full parse.
My es5-lexer is a carefully constructed and efficient lexer for EcmaScript 5 that provides the ability to tokenize JavaScript. It is heuristic where JavaScript's grammar is not lexically regular but the heuristic is very good and it provides a means to transform a token stream so that an interpreter is guaranteed to interpret it the way the lexer interpreted the tokens so if you don't trust your input, you can still be sure that the interpretation underlying the security transformations is sound even if not correct according to the spec for some bizarre inputs.
Your problem seams to be in same family of problems as what is solved with the JS Opfuscators and JS Compressors -- they as well as you need to be able to parse and reformat the JS to an equivalent script;
There was a good discussion on obfuscators here and the possible solution to your problem could be to leverage the parse and generator part from one of the FOSS versions.
One callout, your example code does not take into account the scopes of the variables you want to set/get and that will eventually become a problem that you will have to solve.
Addition
Given the scope problem for closure defined functions, you are probably unlikely to be able to solve this problem as a static parsing problem, as the scope variables outside the closure will have to be imported/exported to resolve/save and re-instantiate scope. Hence you may need to dig into the evaluation engine itself, and perhaps get the V8 engine and make a hack to the interpreter itself -- that is assuming that you do not need this to be generic cross all script engines and that you can tie it down to a single implementation which you control.
I have found that the size of the compiled JavaScript grows faster than I had expected. Adding a few lines of Java code to my project can increase the script size in several Kbs.
At the moment my compiled project weights 1Mb. I'm not using any external libraries except for those for MVP (Activities & Places) , testing (JUnit) and logging.
I would like to know if there are any coding practices/recommendations to keep the compiled script as small as possible. I'm not refering to code splitting, but to coding techniques or patterns that can make the compiled JavaScript effectively smaller.
Many thanks
GWT uses a "pay as you go" design philosophy, and since you're not allowed to use reflection the compiler can statically prove (on a method-by-method basis) that a section of code is "reachable", and eliminate those that are not. For example, if you never use the remove() method on ArrayList, then that code does not get included in the resulting JavaScript.
If you are seeing several kilobyte jumps with the addition of just a few lines, it probably means that you've introduced the use of a new type (and possibly one that depends on other new types) that you had not yet been using. It might also mean that you've made a change to send this new type "over the wire" back to the server, in which case a GWT generator had to include JavaScript for marshaling that type, and any new types that are reachable via its "has-a" and "is-a" references.
So if it were me, I would begin there: when you catch a 2-line change making a multi-kilobyte increase, start by looking at the types and asking whether it is a type that you have used before, and whether you're sending a new type over the wire, and whether that type also depends on other types under the hood.
One final thought: in Ray Ryan's 2009 presentation at Google I/O he mentioned a superstition that he had picked up from the GWT compiler team, where they recommended against using generic types (I'm not speaking of Java Generics here, but rather supertypes) as RPC arguments & return values. In particular, instead of having your RPC call take or return a Map, have it take or return a HashMap instead. The belief is that the GWT generator can then narrow the amount of serialization code that it has to create at compile time (because it could, for example, refrain from generating serialization code for a TreeMap).
I hope this helps.
GWT creates different output versions for each supported browser, so when you say the project size is 1MB are you then referring to the combined size of these ? (each browser only download's the one it actually needs).
I have tried to experiment with the generated output when using various inheritance/class/generics constructs. Unfortunately the extra complexity introduced far outweighs the small size improvements gained (when fx. dumping generics).
I have been on some large GWT projects (+50.000 lines) and have found that code obfuscating coupled with turning on compression on the web server to be the simplest most effective way to minimize the downloads. If this does not shrink the code enough, then look into GWT's compilation report which you can use to pinpoint potential problematic classes and places to insert code splitting.