With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?
Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.
No, there is no way to match only top level parentheses with regex
Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).
Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:
To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.
But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.
This argument works also for your specific concern of being only interested in the top level parentheses.
To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:
((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel ^^ ^ ^^
So yes, you need something more powerful than regex.
While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.
Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.
First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):
Starting regex
Finds stuff in brackets that isn't itself brackets.
/{([^{}])*}/g
Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b
/{([^{}] | {([^{}])*} )*}/g
That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:
//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g
//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g
Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.
If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:
I think the following should do that for you and helpfully group any "top-level" content:
\(([^\(]*?)\)
That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".
Related
Some Context:
• I'm still learning to code atm (started less than a year ago)
• I'm mostly self taught at that since I think my computer science class feels
too slow.
• The website I'm learning on is code.org, specifically in the "game lab"
• The site's coding environments only use ES5 because they don't want to
update them to ES6 or something like that
• In class we're making function libraries and while not required, I want
mine to be "highly usable," for lack of a better term, while also being
reasonably short (prefer not to automate things if I can get them done
quicker somehow, but that's just personal preference).
So now for where the actual question comes in: in a stringified array, is it possible to differentiate between a quotation mark that was inside a string and a quotation mark that actually denotes a string? Because I noticed something confusing with the output of JSON.parse(JSON.stringify()) on code.org, specifically, if you write something like,
JSON.parse(JSON.stringify(['hi","hi']))
the output will be ["hi","hi"] which looks just like an array containing two strings (on code.org it doesn't show the \'s), but still contains just one, which is fine unless you're using a regular expression to detect whether or not a match is within a string (if every quotation mark after the match has a "partner"), which is what I'm doing in 4 different functions. One flattens a list (since ES5 doesn't have Array.prototype.flat()), one removes all instances of the arguments from a list, one removes all instances of specified operand types, and one replaces all instances of an argument with the one that follows it.
Now I know the odds of a string containing an odd number of quotation marks (whether single or double) is likely extremely low, but it still bothers me that not having a way to differentiate between quotes formerly within a string and quotes which formerly denoted a string (in an array after it's been stringified) as these functions otherwise function exactly as intended. The regular expression I'm using to determine if there's an even number of quotes left in the stringified array is /(?=[^"]*(?:(?:"[^"]*){2})*$)/ where you put the match before the lookahead assertion and anything you absolutely want to follow before the first [^"]*.
To highlight the actual issue I'm trying to solve, this is my flatten function (since it's the shortest of the 4), and yeah, yeah, I know "eval bad" but it's extremely convenient to use here since it shortens the actual modification into a single line, and I highly doubt anyone's actually going to find a way to abuse it given its implementation ("this" needs to be an array for splice to work, so if I'm not mistaken, there isn't really a way to abuse it, but tell me if I'm wrong, since I probably am).
Array.prototype.flatten = function() {
eval(('this.splice(0,this.length,' + JSON.stringify(this).replace(/[\[\]](?=[^"]*(?:(?:"[^"]*){2})*$)/g, '') + ')').replace(/,(?=((,[^"]*(?:(?:"[^"]*){2})*)*.$))/g, ''));
return this;
};
This works really well outside of the previously specified conditions, but if I were to call it with something like [1,'"'] it'd find 3 quotation marks after the \[ and wouldn't be able to remove it but would be able to remove the \], thus when eval actually gets to .splice(), it would look like eval('this.splice(0,this.length,[1,"\"")') causing the error Unexpected token ')' to be thrown
Any help on this is appreciated, even if it's just telling me it isn't possible, thanks for reading my ramblings.
TL;DR: in a stringified array is it possible to differentiate between " and \" (string wrapping quotes of strings within a stringified array and quotes within a string within a stringified array) in a regular expression or any other method using only the tools available in ES5 (site I'm learning on doesn't want to update their project environments for whatever reason)
You are having a problem because your input is not a context free grammar and can not be correctly parsed with regular expressions.
Can you explain why JSON.parse is unacceptable? It is even in ancient browsers and versions of node.js.
Someone writing a json parser might use bison or yacc, so if this is a learning experience consider playing with jison.
I ended up finding a way to do this, for whatever reason (either I didn't notice last night because I was tired or it legitimately changed overnight, though likely the former) I can now see the " when viewing the value of the the stringified array, and lo and behold modifying the regular expression so that it ignored instances of " resolved the issue.
New regular expression for quotation mark pair matching now reads:
// old even number of quotation marks after match check
/(?=[^"]*(?:(?:"[^"]*){2})*$)/
// new even number of quotation marks after match check
/(?=(\\"|[^"])*(?:(?:(?<!\\)"(\\"|[^"])*){2})*$)/
// (only real difference is that it accounts for the \)
Sorry for anyone who may have misunderstood the question due to how all over the place it was, I'm aware that I tend to end up writing a lot more than is necessary and it often leads to tangents that muddle my view of what I was initially asking, which in turn makes the point I'm actually trying to get across even harder to grasp at. Thanks to those who still tried to help me regardless of how much of a mess of a first question this was.
Disclaimer: my question is not focused on the exercise, it's just an example (although if you have any interesting tips on the example itself, feel free to share!).
Say I'm working with parsing some strings with Regex in JavaScript, and my main focus is performance (speed).
I have a piece of regex which checks for a numeric string, and then parses it using Number if it's numeric:
if (/^\[[0-9]+]$/.test(str)) {
val = Number(str.match(/^\[([0-9]+)$/)[1]);
}
Note how the conditional test does not have a capture group around the digits. This leads to writing out basically the same regex twice, except with a capture group the second time.
What I would like to know is this; does adding a capture group to a regex used alongside test() in a condition affect performance in any way? I'd like to simply use the capture regex in both places, as long as there is no performance hit.
And to the question as why I'm doing test() then match() rather than match() and checking null; I want to keep parsing as fast as possible when there's a miss, but it's ok to be a little slower when there's a hit.
If it's not clear from the above, I'm referring to JavaScript's regex engine - although if this differs across engines it'd be nice to know too. I'm working specifically in Node.js here, should it also differ across JS engines.
Thanks in advance!
Doing 2 regexps - that are very similar in scope - will almost always be slower than doing a single one because regexps are greedy (that means that they will try to match as much as they can, usually meaning take the maximum amount of time possible).
What you're asking is basically: is the cost of fewer memory in the worst case scenario (aka using the .test to save on memory from capture) faster than just using the extra memory? The answer is no, using extra memory speeds up your process.
Don't take my word for it though, here's a jsperf: http://jsperf.com/regex-perf-numbers
I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...
Um. Okay, so nobody wants to write the answer, but basically the answer here is
Backtracking
It can cause exponential runtime when you do certain non-greedy things.
The answer to the first part of my question:
The two-nested expression is as follows:
\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)
The transformation to make the next nested expression is to replace instances of [^)(]* with ((\([^)(]*\))|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):
s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/
This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!
To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require
/(?:\([^)(]*){2}(?:[^)(]*\)){2}/
To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.
I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".
Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.
Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.
I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.
Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.
I've had good luck with diff_match_patch. There are some good options for tuning it for readability.
Try http://prettydiff.com/ Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.
I am looking for a way, using static analysis of two JavaScript functions, to tell if they are the same. Let me define multiple definitions of "the same".
Level 1: The functions are the same except for possible different whitespace, e.g. TABS, CR, LF and SPACES.
Level 2 The functions may have different whitespace like Level 1, but also may have different variable names.
Level 3 ???
For level one, I think I could just remove all (non-literal, which may be tough) whitespace from each string containing the two JS function definitions, and then compare the strings.
For level two, I think I would need to use something like SpiderMonkey's parser to generate a two parse trees, and then write a comparer which walks the trees and allows variables to have different names.
[Edit] Williham, makes a good point below. I do mean identical. Now, I'm looking for some practical strategies, particularly with regards to using parse trees.
Reedit:
To expound on my suggestion for determining identical functions, the following flow can be suggested:
Level 1: Remove any whitespace that is not part of a string literal; insert newlines after each {, ; and } and compare. If equal; the functions are identical, if not:
Level 2: Move all variable declarations and assignments that don't depend on the state of other variables defined in the same scope to the start of the scope they are declared in (or if not wanting to actually parse the JS; the start of the braces); and order them by line length; treating all variable names as being 4 characters long, and falling back to alphabetical ordering ignoring variable names in case of tied lengths. Reorder all collections in alphabetical order, and rename all variables vSNN, where v is literal, S is the number of nested braces and NN is the order in which the variable was encountered.
Compare; if equal, the functions are identical, if not:
Level 3: Replace all string literals with "sNN", where " and s are literal, and NN is the order in which the string was encountered. Compare; if equal, the functions are identical, if not:
Level 4: Normalize the names of any functions known to be the same by using the name of the function with the highest priority according to alphabetical order (in the example below, any calls to p_strlen() would be replaced with c_strlen(). Repeat re-orderings as per level 1 if necessary. Compare; if equal, the functions are identical, if not; the functions are almost certainly not identical.
Original answer:
I think you'll find that you mean "identical", not "the same".
The difference, as you'll find, is critical:
Two functions are identical if, following some manner of normalization, (removing non-literal whitespace, renaming and reordering variables to a normalized order, replacing string literals with placeholders, …) they compare to literally equal.
Two functions are the same if, when called for any given input value they give the same return value. Consider, in the general case, a programming language which has counted, zero-terminated strings (hybrid Pascal/C strings, if you will). A function p_strlen(str) might look at the character count of the string and return that. A function c_strlen(str) might count the number of characters in the string and return that.
While these functions certainly won't be identical, they will be the same: For any given (valid) input value they will give the same value.
My point is:
Determining wether two functions are identical (what you seem to want to achieve) is a (moderately) trivial problem, done as you describe.
Determining wether two functions are truly the same (what you might actually want to achieve) is non-trivial; in fact, it's downright Hard, probably related to the Halting Problem, and not something that can be done with static analysis.
Edit: Of course, functions that are identical are also the same; but in a highly specific and rarely useful way for complete analysis.
Your approach for level 1 seems reasonable.
For level 2, how about do some rudimentary variable substitution on each function and then do approch for level 1? Start at the beginning and for each variable declaration you encounter rename them to var1, var2, ... varX.
This does not help if the functions declare variables in different orders... var i and var j may be used the same way in both functions but are declared in different orders. Then you are probably left doing a comparison of parse trees like you mention.
See my company's (Semantic Designs) Smart Differencer tool. This family of tools parses source code according to compiler-level-detail grammar for the language of interest (in your case, JavaScript), builds ASTs, and then compares the ASTs (which effectively ignores whitespace and comments). Literal values are normalized, so it doesn't matter how they are "spelled"; 10E17 has the same normalized value as 1E18.
If the two trees are the same, it will tell you "no differences". If they differ by a consistent renaming of an identifier, the tool will tell you the consisten renaming and the block in which it occurs. Other differences are reported as language-element (identifier, statement, block, class,...) insertions, deletions, copies, or moves. The goal is to report the small set of deltas that plausibly explain the differences. You can see examples for a number of languages at the web site.
You can't in practice go much beyond this; to determine if two functions compute the same answer, in principle you have to solve the halting problem. You might be able to detect where two language elements that are elements of a commutative list, can be commuted without effect; we're working on this. You might be able to apply normalization rewrites to canonicalize certain forms (e.g., map all multiple declarations into a sequence of lexically sorted single declarations). You might be able to convert the source code into its equivalent set of dataflows, and do a graph isomorphism match (the Programmer's Apprentice from MIT back in the 1980's proposed to do this, but I don't think they ever got there).
All of there are more work to do than you might expect.