How to write a parser using javascript? - javascript

In our product we are trying to parse the following different formats from a given piece of text -
${{node::123456}}
${{node:123456}}
$fn{{#functionName('abcd',',',' somethingWithASpace')}}
$fn{{#functionName('abcd','#','${{node::123456}}')}}
${{rmtrqst:someText[]->abcd}}
Sample of the text is like -
Hi, how are you ${{node::123456}}? Your order id is ${{node::636636}}.
or
Your order was placed on $fn{{#dateConverterFunction('abcd','#','${{node::123456}}')}}
I tried with Regex /\$((fn)\{{2}(\#|)(\w*)((\(.*\))|([^\$]*))\}{2})/gi - but this is not helping much. Can anyone suggest me how to write a parser for this?
A grammar could be like this -
Every expression starts with $ followed by either fn{{ or {{
After that there will be a string like node or #functionName or something else
that might be followed by a parenthesis enclosed string (this may contain the whole expression like ${{node::1234}} inside it - we should ignore whatever inside parenthesis
Finally it will be closed by }}

Use a tokenizer and let it break the strings down to a meaningful structure.
The nearly.js library is a popular choice for parsing non-linear structures like yours. You can choose to keep your expressions simple - or, if choose otherwise, the library can create an abstract syntax tree for complicated grimmer.
To write a parser using the library, define your vocabulary in a seperate file and use it for parsing.
Or you can directly using the tokanizer to get your string tokanized.
#{%
const moo = require("moo");
const lexer = moo.compile({
ws: /[ \t]+/,
number: /[0-9]+/,
word: /[a-z]+/,
times: /\*|x/
});
%}
# Pass your lexer object using the #lexer option:
#lexer lexer
# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}
# Literal strings now match tokens with that text:
trig -> "sin" %number

Related

[Nearley]: how to parse matching opening and closing tag

I'm trying to parse a very simple language with nearley: you can put a string between matching opening and closing tags, and you can chain some tags. It looks like a kind of XML, but with[ instead of < , with tag always 2 chars long, and without nesting.
[aa]My text[/aa][ab]Another Text[/ab]
But I don't seem to be able to parse correctly this, as I get the grammar should be unambiguous as soon as I have more than one tag.
The grammar that I have right now:
#builtin "string.ne"
#builtin "whitespace.ne"
openAndCloseTag[X] -> "[" $X "]" string "[/" $X "]"
languages -> openAndCloseTag[[a-zA-Z] [a-zA-Z]] (_ openAndCloseTag[[a-zA-Z] [a-zA-Z]]):*
string -> sstrchar:* {% (d) => d[0].join("") %}
And related, Ideally I would like the tags to be case insensitive (eg. [bc]TESt[/BC] would be valid)
Has anyone any idea how we can do that? I wasn't able to find a nearley XML parser example .
Your language is almost too simple to need a parser generator. And at the same time, it is not context free, which makes it difficult to use a parser generator. So it is quite possible that the Nearly parser is not the best tool for you, although it is probably possible to make it work with a bit of hackery.
First things first. You have not actually provided an unambiguous definition of your language, which is why your parser reports an ambiguity. To see the ambiguity, consider the input
[aa]My text[/ab][ab]Another Text[/aa]
That's very similar to your test input; all I did was swap a pair of letters. Now, here's the question: Is that a valid input consisting of a single aa tag? Or is it a syntax error? (That's a serious question. Some definitions of tagging systems like this consider a tag to only be closed by a matching close tag, so that things which look like different tags are considered to be plain text. Such systems would accept the input as a single tagged value.)
The problem is that you define string as sstrchar:*, and if we look at the definition of sstrchar in string.ne, we see (leaving out the postprocessing actions, which are irrelevant):
sstrchar -> [^\\'\n]
| "\\" strescape
| "\\'"
Now, the first possibility is "any character other than a backslash, a single quote or a newline", and it's easy to see that all of the characters in [/ab] are in sstrchar. (It's not clear to me why you chose sstrchar; single quotes don't appear to be special in your language. Or perhaps you just didn't mention their significance.) So a string could extend up to the end of the input. Of course, the syntax requires a closing tag, and the Nearley parser is determined to find a match if there is one. But, in fact, there are two of them. So the parser declares an ambiguity, since it doesn't have any criterion to choose between the two close tags.
And here's where we come up against the issue that your language is not context-free. (Actually, it is context-free in some technical sense, because there are "only" 676 two-letter case-insensitive tags, and it would theoretically be possible to list all 676 possibilities. But I'm guessing you don't want to do that.)
A context-free grammar cannot express a language that insists that two non-terminals expand to the same string. That's the very definition of context-free: if one non-terminal can only match the same input as a previous non-terminal, then
the second non-terminals match is dependent on the context, specifically on the match produced by the first non-terminal. In a context-free grammar, a non-terminal expands to the same thing, regardless of the rest of the text. The context in which the non-terminal appears is not allowed to influence the expansion.
Now, you quite possibly expected that your macro definition:
openAndCloseTag[X] -> "[" $X "]" string "[/" $X "]"
is expressing a context-sensitive match by repeating the $X macro parameter. But it is not by accident that the Nearley documentation describes this construct as a macro. X here refers exactly to the string used in the macro invocation. So when you say:
openAndCloseTag[[a-zA-Z] [a-zA-Z]]
Nearly macro expands that to
"[" [a-zA-Z] [a-zA-Z] "]" string "[/" [a-zA-Z] [a-zA-Z] "]"
and that's what it will use as the grammar production. Observe that the two $X macro parameters were expanded to the same argument, but that doesn't mean that will match the same input text. Each of those subpatterns will independently match any two alphabetic characters. Context-freely.
As I alluded to earlier, you could use this macro to write out the 676 possible tag patterns:
tag -> openAndCloseTag["aa"i]
| openAndCloseTag["ab"i]
| openAndCloseTag["ac"i]
| ...
| openAndCloseTag["zz"i]
If you did that (and you managed to correctly list all of the possibilities) then the parser would not complain about ambiguity as long as you never use the same tag twice in the same input. So it would be ok with both your original input and my altered input (as long as you accept the interpretation that my input is a single tagged object). But it would still report the following as ambiguous:
[aa]My text[/aa][aa]Another Text[/aa]
That's ambiguous because the grammar allows it to be either a single aa tagged string (whose text includes characters which look like close and open tags) or as two consecutive aa tagged strings.
To eliminate the ambiguity you would have to write the string pattern in a way which does not permit internal tags, in the same way that sstrchar doesn't allow internal single quotes. Except, of course, it is not nearly so simple to match a string which doesn't contain a pattern, than to match a string which doesn't contain a single character. It could be done using Nearley, but I really don't think that it's what you want.
Probably your best bet is to use native Javascript regular expressions to match tagged strings. This will prove simpler because Javascript regular expressions are much more powerful than mathematical regular expressions, even allowing the possibility of matching (certain) context-sensitive constructions. You could, for example, use Javascript regular expressions with the Moo lexer, which integrates well into Nearley. Or you could just use the regular expressions directly, since once you match the tagged text, there isn't much else you need to do.
To get you started, here's a simple Javascript regular expression which matches tagged strings with matching case-insensitive labels (the i flag at the end):
/\[([a-zA-Z]{2})\].*?\[\/\1\]/gmi
You can play with it online using Regex 101

Storing Regex as strings in database and then retrieving for use in Javascript and PHP

Main Question: Should escaped backslashes also be stored in the database for Javascript and how well that would play with PHP's regex engine?
Details
I have a number of regex patterns which can be used to classify strings into various categories. An example is as below:
(^A)|(\(A)
This can recognize for example an "A" in the start of the string or if it is immediately after an opening bracket ( but not if it is anywhere else in the string.
DBC(ABC)AA
ABC(DBC)AA
My project uses these regex patterns in two languages PHP and Javascript.
I want to store these patterns in a MySQL database and since there is no datatype for regex, I thought I could store it as VARCHAR or TEXT.
The issue arises if I directly use strings in Javascript, the \( is counted only as ( as the \ backslash is used as an escape character. if this is used to create new RegExp it gives an error:
Uncaught SyntaxError: unterminated parenthetical
For example:
let regexstring = "(^A)|(\(A)";
console.log(regexstring); // outputs => "(^A)|((A)"
let regex = new RegExp(regexstring); // Gives Uncaught SyntaxError: unterminated parenthetical
Based on this answer in StackOverflow, the solution is to escape the backslashes like:
let regexstring = "(^A)|(\\(A)";
console.log(regexstring); // Outputs => "(^A)|(\\(A)"
regex = new RegExp(regexstring);
The question is therefore, should escaped backslashes also be stored in the database and how well that would play with PHP's regex engine?
I would store the raw regular expression.
The additional escape character is not actually part of the regex. It's there for JS to process the string correctly, because \ has a special meaning. You need to specify it when writing the string as "hardcoded" text. In fact, it would also be needed in the PHP side, if you were to use the same assignment technique in PHP, you would write it with the escape backslash:
$regexstring = "(^A)|(\\(A)";
You could also get rid of it if you changed the way you initialize regexstring in your JS:
<?
...
$regexstring = $results[0]["regexstring"];
?>
let regexstring = decodeURIComponent("<?=rawurlencode($regexstring);?>");
console.log(regexstring);
Another option is to just add the escaping backslashes in the PHP side:
<?
...
$regexstring = $results[0]["regexstring"];
$escapedRegexstring = str_replace('\', '\\', $regexstring);
?>
let regexstring = "<?=$escapedRegexstring;?>";
However, regardless of escaping, you should note that there are other differences in syntax between PHP's regex engine and the one used by JS, so you may end up having to maintain two copies anyway.
Lastly, if these regex expressions are meant to be provided by users, then keep in mind that outputting them as-is into JS code is very dangerous as it can easily cause an XSS vulnerability. The first method, of passing it through rawurlencode (in the PHP side) and decodeURIComponent (in the JS side) - should eliminate this risk.

es6 multiline template strings with no new lines and allow indents

Been using es6 more and more for most work these days. One caveat is template strings.
I like to limit my line character count to 80. So if I need to concatenate a long string, it works fine because concatenation can be multiple lines like this:
const insert = 'dog';
const str = 'a really long ' + insert + ' can be a great asset for ' +
insert + ' when it is a ' + dog;
However, trying to do that with template literals would just give you a multi-line string with ${insert} placing dog in the resulting string. Not ideal when you want to use template literals for things like url assembly, etc.
I haven't yet found a good way to maintain my line character limit and still use long template literals. Anyone have some ideas?
The other question that is marked as an accepted is only a partial answer. Below is another problem with template literals that I forgot to include before.
The problem with using new line characters is that it doesn't allow for indentation without inserting spaces into the final string. i.e.
const insert = 'dog';
const str = `a really long ${insert} can be a great asset for\
${insert} when it is a ${insert}`;
The resulting string looks like this:
a really long dog can be a great asset for dog when it is a dog
Overall this is a minor issue but would be interesting if there was a fix to allow multiline indenting.
Two answers for this problem, but only one may be considered optimal.
Inside template literals, javascript can be used inside of expressions like ${}. Its therefore possible to have indented multiline template literals such as the following. The caveat is some valid js character or value must be present in the expression, such as an empty string or variable.
const templateLiteral = `abcdefgh${''
}ijklmnopqrst${''
}uvwxyz`;
// "abcdefghijklmnopqrstuvwxyz"
This method makes your code look like crap. Not recommended.
The second method was recommended by #SzybkiSasza and seems to be the best option available. For some reason concatenating template literals didn't occur to me as possible. I'm derp.
const templateLiteral = `abcdefgh` +
`ijklmnopqrst` +
`uvwxyz`;
// "abcdefghijklmnopqrstuvwxyz"
Why not use a tagged template literal function?
function noWhiteSpace(strings, ...placeholders) {
let withSpace = strings.reduce((result, string, i) => (result + placeholders[i - 1] + string));
let withoutSpace = withSpace.replace(/$\n^\s*/gm, ' ');
return withoutSpace;
}
Then you can just tag any template literal you want to have line breaks in:
let myString = noWhiteSpace`This is a really long string, that needs to wrap over
several lines. With a normal template literal you can't do that, but you can
use a template literal tag to allow line breaks and indents.`;
The provided function will strip all line breaks and line-leading tabs & spaces, yielding the following:
> This is a really long string, that needs to wrap over several lines. With a normal template literal you can't do that, but you can use a template literal tag to allow line breaks and indents.
I published this as the compress-tag library.

Is it possible to have a comment inside a es6 Template-String?

Let's say we have a multiline es6 Template-String to describe e.g. some URL params for a request:
const fields = `
id,
message,
created_time,
permalink_url,
type
`;
Is there any way to have comments inside that backtick Template-String? Like:
const fields = `
// post id
id,
// post status/message
message,
// .....
created_time,
permalink_url,
type
`;
Option 1: Interpolation
We can create interpolation blocks that return an empty string, and embed the comments inside them.
const fields = `
id,${ /* post id */'' }
message,${ /* post status/message */'' }
created_time,
permalink_url,
type
`;
console.log(fields);
Option 2: Tagged Templates
Using tagged templates we can clear the comments and reconstruct the strings. Here is a simple commented function that uses Array.map(), String.replace(), and a regex expression (which needs some work) to clear comments, and return the clean string:
const commented = (strings, ...values) => {
const pattern = /\/{2}.+$/gm; // basic idea
return strings
.map((str, i) =>
`${str}${values[i] !== undefined ? values[i] : ''}`)
.join('')
.replace(pattern, '');
};
const d = 10;
const fields = commented`
${d}
id, // post ID
${d}
message, // post/status message
created_time, // ...
permalink_uri,
type
`;
console.log(fields);
I know it's an old answer, but seeing the answers above I feel compelled to both answer the pure question, and then to answer the spirit of the asker's question.
Can you use comments in template literal strings?
Yes. Yes you can. But it's not pretty.
const fields = `
id, ${/* post ID */''}
message, ${/* post/status message */''}
created_time, ${/*... */''}
permalink_url,
type
`;
Note that you have to put '' (an empty string) in the ${ } braces so that Javascript has an expression to insert. Not doing so will result in a runtime error. The quotes can go anywhere outside of the comment.
I'm not a huge fan of this. It's pretty ugly and makes commenting cumbersome, nevermind that toggling comments becomes difficult in most IDEs.
Personally, I use template strings wherever possible, as they are a fraction more efficient than regular Strings, and they capture literally all the text you want, mostly without escaping. You can even put function calls in there!
The string in the example above will be a little odd, and potentially useless for what you're looking for, however, as there will be an initial line-break, extra space between the comma and the comment, as well as an extra final line-break. Removing that unwanted space could be a small performance hit. You could use a regex for that, for speed and efficiency, though... more on that below...
.
Now to answer the intent of the question:
How do I write a comma-delimited list string, with comments on every line?
const fields = [
"id", // post ID
"message", // post/status message
"created_time", //...
"permalink_url",
"type"
].join(",\n");
Joining an Array is one way... (as suggested by #jared-smith )
However, in this case, you are creating an array and then immediately discarding the organized data when you only assign the return value of the join() function. Not only that, but you are creating a memory pointer for each string in the array, which won't be garbage collected till end of scope. In that case, it might be more useful to capture the array, joining on the fly as use dictates, or to use a template literal and differently comment your implementation, like ghostDoc style.
It seems that you are only using template literals in order to satisfy a desire to not have quote marks on each line, minimizing cognitive dissonance between the 'string' query parameters as they look in the url and the code. You should be aware that this preserves line breaks, and I doubt you want that. Consider instead:
/****************
* Fields:
* id : post ID
* message : post/status message
* created_time : some other comment...
*/
const fields = `
id,
message,
created_time,
permalink_uri,
type
`.replace(/\s/g,'');
This uses a regex to filter out all the whitespace, while keeping the list readable and rearrangeable. All the regex literal is doing is capturing the whitespace and then the replace method replaces the captured text with '' (the g on the end just tells the regex not to stop at the first match it finds, in this case, the first newline char.)
or, most nastily, you could just put the comments directly in your template literal, and then strip them with a regex:
const fields = `
id, // post ID
message, // post/status message
created_time, // ...
permalink_uri,
type
`.replace(/\s+\/\/.*\*\/\n/g,'').replace(/\s/g,'');
That first regex will find and replace with an empty string ('') all instances of: one or more whitespace characters that precede a double slash (each slash is escaped by a backslash) followed by whitespace and the new line character. If you wanted to use /* multiline */ comments, this regex becomes a little more complex, you'll have to add another .replace() on the end:
.replace(/\/\*.*\*\//g,'')
That regex can only go after you strip the \n newlines out, or the regex won't match the now-not-multiline comment. That would look something like this:
const fields = `
id, // post ID
message, /* post/
status message */
created_time, // ...
permalink_uri,
type
`.replace(/\s+\/\/.*\n/g,'').replace(/\s/g,'').replace(/\/\*.*\*\//g,'');
All of the above will result in this string:
"id,message,created_time,permalink_uri,type"
There's probably a way to do that with only one regex, but it's beyond the scope here, really. And besides, I'd encourage you to fall in love with regexes by playing with them yourself!
I'll try to get a https://jsperf.com/ up on this later. I'm super curious now!
No.
That syntax is valid, but will just return a string containing \n// post id\nid, rather than removing the comments and creating a string without them.
If you look at §11.8.6 of the spec, you can see that the only token recognized between the backtick delimiters is TemplateCharacters, which accepts escape sequences, line breaks, and normal characters. In §A.1, SourceCharacter is defined to be any Unicode point (except the ones excluded in 11.8.6).
Just don't use template strings:
const fields = [
'id', // comment blah blah
'message',
'created_time',
'permalink_url',
'type'
].join(',');
You pay the cost of the array and method call on initialization (assuming the JIT isn't smart enough to optimize it away entirely.
As pointed out by ssube, the resulting string will not retain the linebreaks or whitespace. It depends on how important that is, you can manually add ' ' and '\n' if necessary or decide you don't really need inline comments that badly.
UPDATE
Note that storing programmatic data in strings is generally held to be a bad idea: store them as named vars or object properties instead. Since your comment reflects you're just converting a bunch of stuff into a
url query string:
const makeQueryString = (url, data) => {
return url + '?' + Object.keys(data)
.map(k => `${k}=${encodeURIComponent(data[k))}`)
.join('&');
};
let qs = makeQueryString(url, {
id: 3,
message: 'blah blah',
// etc.
});
Now you have stuff that is easier to change, understand, reuse, and more transparent to code analysis tools (like those in your IDE of choice).
Yes it is possible
Use <!-- content here -->

Understand code from string using Regex or something - Js

I wanted to run a string replace function on a piece of code and make sure that all the strings in the code is intact and unchanged using javascript. For example if I have a code like below:
var a = "I am ok";
if (a == "I am ok") {
alert("That's great to know");
}
Now, I want to run a string replace on this code block. But it should only effect the code part of it. Not the strings which are in double quotes. Can this be done using regex or any other method?
AST
To avoid any chance of error in code manipulation using an Abstract Syntax Tree (AST) type solution is best. One example implementation is in UglifyJS2 which is a JavaScript parser, minifier, compressor or beautifier toolkit.
RegEx
Alternatively if an AST is over the top for your specific task you can use RegEx.
But do you have to contend with comments too?
The process might look like this:
Use a carefully formed regex to split the JavaScript code string based on these in this order:
comment blocks
comment lines
quoted strings both single and double quotes (remembering to contend with escaping of characters).
Iterate though the split components. If string (beings with " or ') or comment (begins with // or /*) ignore, otherwise run your replacement.
(and the simple part) join array of strings back together.
You would have to place the function code in a string variable, run a normal regex operation over that string, and then convert it to a function afterwards with:
var func = new Function('a', 'b', 'return a + b');
EDIT: Use regex to exclude the text between double quotes if you need to.

Categories

Resources