JavaScript function to escape Java regular expression string - javascript

Earlier questions on StackOverflow discuss escaping for JavaScript regular expressions, e.g.:
How to escape regular expression in javascript?
Escape string for use in Javascript regex
An implementation suggested is the following:
RegExp.quote = function(str) {
return str.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
};
Given that regular expressions in the two languages are not identical, is anyone aware of a JavaScript method that properly escapes strings to be used for Java regular expressions?

There's no need for any escaping at all. Those questions are about what needs to be done when the regular expression is being constructed as a string in the source language. Since you're reading the string from an input field, there's no layer of interpretation to worry about.
Just send the string to the server, where it will be discovered to be a valid regex or not.
edit — though I can't think of any, the real thing to worry about might be any sort of "injection" attack that could be conducted through this avenue. Seems to me that if you're just passing a regex to Pattern.compile() there aren't any side-effect channels that could be exploited.

Related

Javascript string literal syntax rules

Could somebody please explain why
const btn1 = document.querySelector('input[id="btn"]')
requires me to use ('input[id="btn"]') and not ("input[id="btn"]") or ('input[id='btn']').
Because the JavaScript engine needs to be able to unambiguously determine where a string literal begins and ends.
If an unescaped character which is the same as the string delimiter was permitted inside the string, how would the interpreter determine whether the character was there to terminate the string, or if it was to be interpreted as a literal ' (or ") to be part of the string?
Rigorous syntactical rules are required for the unambiguous evaluation of JavaScript source text to be possible. (But string escaping is pretty trivial, and common to most all programming languages.) Learn it once, for any language, and you'll probably be well suited for understanding how it can work in many other languages. In JS, it's really not that hard compared to many more complicated constructs (like async/await).
You can remove quotes on attribute selector to prevent a stupid formatting rules:
const btn1 = document.querySelector('input[id=btn]')

Confusion regarding RegExp matches, HTML tags, and newlines

I am attempting to create a Markdown-to-HTML parser. I am trying to use regex expressions to match an input string that may or may not contain HTML tags and whitespace/newlines. I have encountered an interesting case that I do not at all understand.
My regex expression is regex = /\*([\w\s]+|<.+>)\*/g.
The following works:
'*words\nmorewords*'.match(regex)
'*<b>words</b>*'.match(regex)
However, this does not work:
'*<b>words\nmore words</b>*'.match(regex)
If anyone can help me understand why this is so, I would appreciate it.
Edit: I see my faulty logic, thanks to Ry. The expression regex = /\*(<[a-z]+>)?[\w\s]+(<\/[a-z]+>)?\*/g solves this case.
This should work for your purpose:
\*(<.+>)?([\w\s]+)(<.+>)?\*
The HTML tags can exist or not (<.+>)?. The \n is matched by the \s (whitespace).
I'm also going to link the canonical don't parse HTML with regex answer, because regex is not suitable for (or even capable of) parsing HTML beyond fairly restricted subsets. Have a read, it's informative (and funny)!
Recall the Chomsky Heirarchy. Regular expressions can parse regular languages. HTML is not a regular language (it is the next level up, context sensitive).
There are extensions to some regular expression engines that give it recursive capability. You can probably parse HTML with these but there are better ways, like using a proper HTML parser for example DOMParser.

Is it possible to parse regex strings with a regex

Just out of curiosity, is it possible to parse a string that is totally made out of random but valid regular expressions with a single regular expression?
given the string of regex:
<[^>]*>\xA9
parses to:
<[^>]*>
\xA9
in which the first one match html and second one match a copyright symbol.
Edit:
I found a similar question asked at SO claiming that it maybe possible. Here, I'm referring to regex in JavaScript ECMA-262 only.
No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced. It is not possible even in theory to write a regular expression that verifies if parentheses are balanced in a given string. Without an ability to do that you wouldn't know where one regexp ends and the other one starts.
In general, regex grammar is relatively complex. To get an idea of just how complex it is, take a look at the parser in the source of Java's Pattern class.

What Javascript Regular Expression features are unique to Javascript?

I hope this question isn't too broad, but then again I would expect the Javascript (and other languages) regular expression engine's to share most of it's functionality with what is considered standard / expected regular expression behavior.
I made a statement about C# having unique regular expression capabilities in this post :: RegEx match open tags except XHTML self-contained tags
Specifically, here is the statement:
C# is unique when it comes to regular expressions in that it supports
Balancing Group
Definitions.
See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions
I'm curious what unique regular expression capabilities javascript has if any.
Although JavaScript’s regular expression library supports features that are considered as common (see comparison table), there is one particular expression that I haven’t seen in other:
/[^]/
This matches any arbitrary character similar to /[\s\S]/ (or any other union of complementary character classes) and can be handy as JavaScript does not have a s modifier like others have to have . match line breaks too.
Similar to that:
/[]/
This evaluates to an empty character set and can’t match anything at all.
javascript regexes are a subset of perl regexes.
Meaning, it has no unique features, but it's missing quite a few.
Javascript regular expressions are modeled on Perl's regular expressions.
See: http://www.regular-expressions.info/javascript.html
JavaScript's regex engine is merely a subset of Perl's engine, meaning that it doesn't add anything new and is missing many of the features Perl contains.
You can read more about it here: http://www.regular-expressions.info/javascript.html.

How to find a URL within full text using regular expression

What is wrong with the following regular expression, which works in many online JavaScript regular expression testers (and RegEx Buddy), yet doesn't work in my application?
It is intended to replace URLs with a Hyperlink. The Javascript is found in a javascript file.
var fixed = text.replace(/\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|$!:,.;]*[A-Z0-9+&##/%=~_|$]/ig, "<a href='$&' target='blank'>$&</a>");
Chrome, for example, complains that & is not valid (as does IE8). Is there some way to escape the ampersand (or whatever else is wrong), without resorting to the RegEx object?
Those testers let you input the regex in its raw form, but when you use it in source code you have to write it in the form of a string literal or (as is the case here) a regex literal. JavaScript uses forward-slashes for its regex-literal delimiters, so you have to escape any slashes in the regex itself to avoid confusing the interpreter.
Once you escape the slashes it should stop complaining about the ampersand. That was most likely caused by the malformed regex literal.
I recognize that regex, having used it myself the other day; you got it from RegexBuddy's Library, didn't you? If you had used RB's "Use" feature to create a JS-compatible regex, it would have escaped the slashes for you.
This works for me in Chrome
var fixed = text.replace(/(ftp|http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/igm, "<a href='$1' target='blank'>$1</a>");

Categories

Resources