Parsing Custom JavaScript Annotations

Parsing Custom JavaScript Annotations - javascript

Implementing a large JavaScript application with a lot of scripts, its become necessary to put together a build script. JavaScript labels being ubiquitous, I've decided to use them as annotations for a custom script collator. So far, I'm just employing the use statement, like this:
use: com.example.Class;
However, I want to support an 'optional quotes' syntax, so the following would be parsed correctly as well
use: 'com.example.Class';
I'm currently using this pattern to parse the first form:
/\s*use:\s*(\S+);\s*/g
The '\S+' gloms all characters between the annotation name declaration and the terminating semi colon. What rule can I write to substitute for \S+ that will return an annotation value without quotes, no matter if it was quoted or not to begin with? I can do it in two steps, but I want to do it in one.
Thanks- I know I've put this a little awkwardly
Edit 1.
I've been able to use this, but IMHO its a mess- any more elegant solutions? (By the way, this one will parse ALL label names)
/\s*([a-z]+):\s*(?:['])([a-zA-Z0-9_.]+)(?:['])|([a-zA-Z0-9_.]+);/g
Edit 2.
The logic is the same, but expresses a little more succinctly. However, it poses a problem as it seems to pull in all sorts of javascript code as well.
/\s*([a-z]+):\s*'([\w_\.]+)'|([\w_\.]+);/g

Ok -this seemed to do it. Hope someone can improve on it.
/\s*([a-z]+): *('[\w_\/\.]+'|[\w_\/\.]+);/g

Related

How to change the don't-escape HTML delimiter in Mustache.js

I know I can change the default delimiter using Mustache.tags('[[', ']]');
I dig into the source code, but I can't find and figure out how to change the don't-escape HTML delimiter, which is {{{ }}} by default. Any help is appreciated.

I believe your question is how to turn off the default html entity escaping behaviour of a mustache template when you have specified custom delimiters. This can be a bit confusing since the default behaviour, that you will see if you look this up, is to use triple braces such as {{{some-value}}}. I'm going to assume you mean from a users point of view and not a developers point of view - despite the reference to the source code.
There are two ways:
Mustache provides an alternative syntax for turning off HTML escaping using the & character. So with your custom delimiters of '[[' and ']]' you would specify your placeholder as
[[&some-value]]
Simply use '{ }' within your custom delimiters. E.g.
[[{some-value}]]
I don't believe there is any way to change either of these inner syntaxes. Some templating systems are a lot more flexible (e.g. doT uses regexes for all matching), but mustache is less flexible (which many will see as an advantage)
Hope that clears things up. I know this is an old question, but perhaps this might still
help you or anyone else also looking this up.

Changing don't-escape HTML delimiter is only possible by modifying the source, because it's hardcoded into the parser and defined as openingTag + "{" and "}" + closingTag. And with hardcoded I mean, that you'd possibly have to change logic, not just a (few) regex. Thanks to #Thomas to dedicate his time for me.

Jison / Flex: Trying to capture anything (.*) between two tokens but having problems

I'm currently working on a small little dsl, not unlike rabl. I'm struggling with the implementation of one of my rules. Before we get to the problem, I'll explain a bit about my syntax/grammar.
In my little language you can define properties, object/array blocks, or custom blocks (these are all used to build a json object/array). A "custom block" can either be one that contains my standard expressions (property, object/array block, etc) or some JavaScript. These expressions are written as such -
-- An object block
object #model
-- A property node
property some, key(name="value")
-- A custom node
object custom_obj as
property some(name="key")
end
-- A custom script node
property full_name as (u)
// This is JavaScript
return u.first_name + ' ' + u.last_name;
end
end
The problem I'm running into is with my custom script node. I'm having a real hard defining the script token so that JISON can properly capture the stuff inside the block.
In my lexer, I currently have...
# script_param is basically a regex to match "(some_ident)"
{script_param} { this.begin('js'); return 'SCRIPT_PARAM'; }
<js>(.|\n|\r)*?"end" %{
this.popState();
yytext = yytext.substr(0, yyleng - 3).trim();
return 'SCRIPT';
%}
That SCRIPT token will basically match anything after (u) up to (and including) the end token (which usually ends a block). I really dislike this because my usual block terminator (end) is actually part of the script token, which feels totally hacky to me. Unfortunately, I'm not able to find a better way to capture ANYTHING between (..) and end.
I've tried writing a regex that captures anything that ends with a ";", but that poses problems when I have multiple script nodes in my dsl code. I've only been able to make this work by including the "end" keyword as part of my capture.
Here are the links to my grammar and lexer files.
I'd greatly appreciate any insight into solving my problem! If I didn't explain my problem clearly, let me know and I'll try my best to clarify!
Many thanks in advance!!
I will also happily accept any advice as to how to clean up my grammar. I'm still fairly new at this stuff and feel like my stuff is a mess right now :)

It's easy enough to match a string up to but not including the first instance of end:
([^e]|e[^n]|en[^d])*
(And it doesn't even need non-greedy repetition.)
However, that's not what you want. The included JavaScript might include:
variables whose names happen to include the characters end (tendency)
comments (/* Take the values up to the end of the line */)
character strings (if (word == "end"))
and, indeed, the word end itself, which is not a reserved word in js.
Really, the only clean solution is to be able to lex javascript. Fortunately, you don't have to do it precisely, because you're not interpreting it, but even so it is a bit of work. The most annoying part of javascript lexing, like other similar languages, is identifying when / is the beginning of a regular expression, and when it is just division; getting that right requires most of a javascript parser, particularly since it also interacts with the semicolon rule.
To deal with the fact that the included javascript might actually use a variable named end, you have a couple of choices, as far as I can see:
Document the fact that end is a reserved word.
Only recognize end when it appears outside of parentheses and in a place where a statement might start (not too difficult if you end up building enough of a JS parser to correctly identify regular expressions)
Only recognize end when it appears by itself on a line.
This last choice would really simplify your problem a lot, so you might want to think about it, although it's not really very elegant.

Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.
I want to extract all comments from this html file.
I can point out two problems in doing this:
What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I
just apply substituting // and then the
remainder of the line, with nothing.
So this is not a trivial problem.
Is there anywhere some solution for this already available?
Has anybody already done this?

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.
Consider
Problem 3, a comment in a language is not always a comment in a language.
<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>
Problem 4, a comment embedded in a language may not obviously be a comment.
<button onclick="// this is a comment//
notAComment()">
Problem 5, what is a comment may depend on how the browser is configured.
<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->
I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.
https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.
The start/end indices will have to be identified properly, which is a challenge but you will have full power.
There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Writing a Parser for javascript code

I want to extract javasscript code and find out if there are any dynamic tag creations like document.createElement('script'); I have tried to do this with Regular expressions but using regular expressions restricts me to get only some formats so i thought of writing a javascript parser which extracts all the keywords, strings and functions from the javascript code.

In general there is no way to know if a given line of code will ever run, you would need to solve the halting problem.
If you restrict your analysis to just finding occurances of a function call you don't make much progress. Naive methods will still be easy to trick, if you just regex match for document.createElement, you would not be able to match something as simple as document["create" + "Element"]. In general you would need to not only parse the code but evaluate it as well to get around this. And to be sure that you can evaluate the code you would again need to solve the halting problem.

Maybe you should try using Burrito

Well the first rule is never use regex for big things like this, or DOM, or ... . You have to parse it by tokens. The good news is that you don't have to write your own. There are a few JS to JS parsers.
UglifyJS
narcissus
Esprima
ZeParser
They may be a bit hard to work with it. But well better to work with them. There are other projects that are uses these such as burrito or code surgeon. So you can have a look at the source code and see how they uses them.
But there is bad news too, which people can still outsmart other people, let alone the parsers and the code they write. At least you need to evaluate the code with some execution time variables and see if it tries to access the DOM or not.

How do i avoid eval() from converting 1e-1 to 0.1?

I'm using a thrid party javascript library that uses eval() so when i call one of it's functions with the "1e-1" value as a parameter i get 0.1 returned. How can i escape this or avoid it from parsing the number?
A basic example would be:
console.log(eval("1e-1"));
I want the result to be 1e-1, but eval still needs to be there.
EDIT:
Okay Ignore the console example above
THIS is the example it should work on:
There is no way around using this library. Sorry.

Dont use eval(). Of course, Number("1e-1") has the same "problem". However, if you want a string back from eval you have to feed it with one: eval("'1e-1'").

One quick way to do this is to simply replace the hyphen with it's Character Entity code instead:
console.log(eval("1e-1"));
Update
After experimenting for quite a while, the only thing that was close is placing spaces before and after the hyphen:
features[1].attributes.tag= "1e - 1";
I thought it worth mentioning incase this will suffice for what you need.

Develop Reference

JavaScript is the programming language of the Web.

Parsing Custom JavaScript Annotations - javascript

Ok -this seemed to do it. Hope someone can improve on it. /\s([a-z]+): ('[\w_\/\.]+'|[\w_\/\.]+);/g

Related

How to change the don't-escape HTML delimiter in Mustache.js

Jison / Flex: Trying to capture anything (.*) between two tokens but having problems

Finding comments in HTML

Writing a Parser for javascript code

How do i avoid eval() from converting 1e-1 to 0.1?

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Parsing Custom JavaScript Annotations - javascript

Ok -this seemed to do it. Hope someone can improve on it. /\s*([a-z]+): *('[\w_\/\.]+'|[\w_\/\.]+);/g

Related

How to change the don't-escape HTML delimiter in Mustache.js

Jison / Flex: Trying to capture anything (.*) between two tokens but having problems

Finding comments in HTML

Writing a Parser for javascript code

How do i avoid eval() from converting 1e-1 to 0.1?

Categories

Resources

Ok -this seemed to do it. Hope someone can improve on it. /\s([a-z]+): ('[\w_\/\.]+'|[\w_\/\.]+);/g