Is it possible to parse regex strings with a regex - javascript

Just out of curiosity, is it possible to parse a string that is totally made out of random but valid regular expressions with a single regular expression?
given the string of regex:
<[^>]*>\xA9
parses to:
<[^>]*>
\xA9
in which the first one match html and second one match a copyright symbol.
Edit:
I found a similar question asked at SO claiming that it maybe possible. Here, I'm referring to regex in JavaScript ECMA-262 only.

No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced. It is not possible even in theory to write a regular expression that verifies if parentheses are balanced in a given string. Without an ability to do that you wouldn't know where one regexp ends and the other one starts.
In general, regex grammar is relatively complex. To get an idea of just how complex it is, take a look at the parser in the source of Java's Pattern class.

Related

RegExp & PCRE convert to tree with own syntax

Looking for pre-processor for creating own syntax of regular expression, based on RegExp & PCRE syntax so it can be parsed to PCRE syntax. Example at the end
I guess I need a processor of regular expression that outputs a tree structure that represents regular expression, so I can traverse the tree and hotswap some parts, then compile it to regular expression string.
But this processor must have ability to add own syntax parsing/processing.
Is there some processor like this, already made by someone? I've made one by myself some time ago, but looking for more professional solution.
Of course we are talking about node.js/javascript
Yes, node.js has not support for PCRE, but there is a npm module for using PCRE with node.js, it works great!
Why someone would need it?
For example, you can create big regular expression by smaller ones:
(John (like|love)s every (animal|creature) on earth: (#animals))
(#...) is hash tag group, it means in place of it will be another regular expression containing alterantives for all animals.
Another example, you can create more sophisticated kind of groups:
(#(a|x)(b)(c))
permutation group matches all brackets (3 or less or more) in any order:
(a|x)(b)(c)
(a|x)(c)(b)
(b)(a|x)(c)
(b)(c)(a|x)
(c)(a|x)(b)
(c)(b)(a|x)
have more, but I guess I've made a point.

Confusion regarding RegExp matches, HTML tags, and newlines

I am attempting to create a Markdown-to-HTML parser. I am trying to use regex expressions to match an input string that may or may not contain HTML tags and whitespace/newlines. I have encountered an interesting case that I do not at all understand.
My regex expression is regex = /\*([\w\s]+|<.+>)\*/g.
The following works:
'*words\nmorewords*'.match(regex)
'*<b>words</b>*'.match(regex)
However, this does not work:
'*<b>words\nmore words</b>*'.match(regex)
If anyone can help me understand why this is so, I would appreciate it.
Edit: I see my faulty logic, thanks to Ry. The expression regex = /\*(<[a-z]+>)?[\w\s]+(<\/[a-z]+>)?\*/g solves this case.
This should work for your purpose:
\*(<.+>)?([\w\s]+)(<.+>)?\*
The HTML tags can exist or not (<.+>)?. The \n is matched by the \s (whitespace).
I'm also going to link the canonical don't parse HTML with regex answer, because regex is not suitable for (or even capable of) parsing HTML beyond fairly restricted subsets. Have a read, it's informative (and funny)!
Recall the Chomsky Heirarchy. Regular expressions can parse regular languages. HTML is not a regular language (it is the next level up, context sensitive).
There are extensions to some regular expression engines that give it recursive capability. You can probably parse HTML with these but there are better ways, like using a proper HTML parser for example DOMParser.

JavaScript function to escape Java regular expression string

Earlier questions on StackOverflow discuss escaping for JavaScript regular expressions, e.g.:
How to escape regular expression in javascript?
Escape string for use in Javascript regex
An implementation suggested is the following:
RegExp.quote = function(str) {
return str.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
};
Given that regular expressions in the two languages are not identical, is anyone aware of a JavaScript method that properly escapes strings to be used for Java regular expressions?
There's no need for any escaping at all. Those questions are about what needs to be done when the regular expression is being constructed as a string in the source language. Since you're reading the string from an input field, there's no layer of interpretation to worry about.
Just send the string to the server, where it will be discovered to be a valid regex or not.
edit — though I can't think of any, the real thing to worry about might be any sort of "injection" attack that could be conducted through this avenue. Seems to me that if you're just passing a regex to Pattern.compile() there aren't any side-effect channels that could be exploited.

javascript Regular expression for validating arithmetic expression

I have an arithmetic expression
((20+30)-25)/5
I want to validate by using regular expression. The expression can only have integers, floating point numbers, operands and parenthesis.
How can I generate regular expression to validate please help or suggest any other way to validate that string using javascript.
As I said in a comment, this is impossible using one JavaScript regular expression. However, you can do it using a loop: replace subexpressions with atoms, repeat until you get an atom. If you can't reduce any more, and whatever is left is not an atom, it does not validate. This is actually pretty much the same procedure you'd do to evaluate it (just skipping the abstract syntax tree). You can search for \(\d+\)|\d+[-+/*]\d+ and replace with 0:
Example:
((20+30)-25)/5
((0)-25)/5
(0-25)/5
(0)/5
0/5
0
Done
If you failed to match and didn't have just 0, it's a fail.
(To evaluate as opposed to validate, you'd just have to be replacing with with the actual value rather than a dummy stand-in, everything else is the same).
JavaScript "eval" function is the best validator.
Try to do this:
eval("((20+30)-25)5");
and you will get sufficiently detailed error description.
You will only be able to do this with regular expressions if you impose a maximum depth to the parenthesis nesting. Otherwise, the set of arithmetic expressions forms a context free language but not a regular language.
If I had to use regex, the approach I would use is to write a regular grammar for your set of arithmetic expressions and then convert that to a regular expression.
Another approach is to write a recursive descent parser, which is a fairly simple project and works very nicely for arithmetic expressions.

What Javascript Regular Expression features are unique to Javascript?

I hope this question isn't too broad, but then again I would expect the Javascript (and other languages) regular expression engine's to share most of it's functionality with what is considered standard / expected regular expression behavior.
I made a statement about C# having unique regular expression capabilities in this post :: RegEx match open tags except XHTML self-contained tags
Specifically, here is the statement:
C# is unique when it comes to regular expressions in that it supports
Balancing Group
Definitions.
See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions
I'm curious what unique regular expression capabilities javascript has if any.
Although JavaScript’s regular expression library supports features that are considered as common (see comparison table), there is one particular expression that I haven’t seen in other:
/[^]/
This matches any arbitrary character similar to /[\s\S]/ (or any other union of complementary character classes) and can be handy as JavaScript does not have a s modifier like others have to have . match line breaks too.
Similar to that:
/[]/
This evaluates to an empty character set and can’t match anything at all.
javascript regexes are a subset of perl regexes.
Meaning, it has no unique features, but it's missing quite a few.
Javascript regular expressions are modeled on Perl's regular expressions.
See: http://www.regular-expressions.info/javascript.html
JavaScript's regex engine is merely a subset of Perl's engine, meaning that it doesn't add anything new and is missing many of the features Perl contains.
You can read more about it here: http://www.regular-expressions.info/javascript.html.

Categories

Resources