What are the differences between javascript and PCRE regular expressions? [duplicate]

What are the differences between javascript and PCRE regular expressions? [duplicate] - javascript

I'm just a noob when it comes to regexp. I know Perl is amazing with regexp and I don't know much Perl. Recently started learning JavaScript and came across regex for
validating user inputs... haven't used them much.
How does JavaScript regexp compare with Perl regexp? Similarities and differences?
Can all regexp(s) written in JS be used in Perl and vice-versa?
Similar syntax?

From ECMAScript 2018 onwards, many of JavaScript's regex deficiencies have been fixed.
It now supports lookbehind assertions, even unbounded ones.
Unicode property escapes have been added.
There finally is a DOTALL (/s) flag.
What is still missing:
JavaScript doesn't have a way to prevent backtracking by making matches final (using possessive quantifiers ++/*+/?+ or atomic groups (?>...)).
Recursive/balanced subgroup matching is not supported.
One other (cosmetic) thing is that JavaScript doesn't know verbose regexes, which might make them harder to read.
Other than that, the basic regex syntax is very similar in both flavors.

This comparison will answer all your queries.

Another difference: In JavaScript, there is no s modifier: The dot "." will never match a newline character. As a replacement for ".", the character class [\s\S] can be used in JavaScript, which will work like /./s in Perl.

I just ran into an instance where the \d, decimal is not recognized in some versions of JavaScript -- you have to use [0-9].

Related

JS : Test if string contains any unicode capital

Do you know if there is a js regular expression that would catch any possibly unicode capital letter. Of course [A-Z] works but there are thousand of alternate capitals.
Thanks in advance for the hints.

The only Unicode support in JavaScript regex (at least ecmascript 5 and below) is matching specific code points of the form \uFFFF. You can use those in ranges in character classes. (see this question)
This of course, makes your task difficult. But I did find an online utility that says it:
Compiles character ranges suitable for use in JavaScript, using the
cset library.
Selecting "uppercase letter", then, produces this regex:
[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĲĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼǄǇǊǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮǱǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԱ-ՖႠ-ჅḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾ-ℿⅅↃⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱯⱲⱵⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽ-ꝾꞀꞂꞄꞆꞋＡ-Ｚ]|\ud801[\udc00-\udc27]|\ud835[\udc00-\udc19\udc34-\udc4d\udc68-\udc81\udc9c\udc9e-\udc9f\udca2\udca5-\udca6\udca9-\udcac\udcae-\udcb5\udcd0-\udce9\udd04-\udd05\udd07-\udd0a\udd0d-\udd14\udd16-\udd1c\udd38-\udd39\udd3b-\udd3e\udd40-\udd44\udd46\udd4a-\udd50\udd6c-\udd85\udda0-\uddb9\uddd4-\udded\ude08-\ude21\ude3c-\ude55\ude70-\ude89\udea8-\udec0\udee2-\udefa\udf1c-\udf34\udf56-\udf6e\udf90-\udfa8\udfca]
I've also read (but not used personally) that the XRegExp javascript library is good and would allow you to use \p{Lu}.

Here's a link containing all Unicode capital letters. This is based on the GREP engine of Adobe InDesign CC2015, searching for the posix expression [[:upper:]]:
http://www.id-extras.com/uploads/AllUnicodeCapitals.html

With any JavaScript environment supporting the ECMAScript2018+ standard, you can use
/\p{Lu}/u
to test if a string contains any Unicode uppercase letter.
See a JavaScript demo:
console.log(/\p{Lu}/u.test('... Yes!'));
console.log(/\p{Lu}/u.test('Łąka'));
console.log(/\p{Lu}/u.test('и Витя с ними'));
console.log(/\p{Lu}/u.test('nonono'));

RegEx to test if a string contains more than X Unicode words

I saw many solutions that match Latin characters words like this one: /^\W*(\w+\b\W*){80,}$/
I'm looking for the equivalent expression that will support any language with Unicode characters.
The RegEx need to be JavaScript compatible.

EDIT: Javascript sadly doesn't seem to support this solution... You might want to look into XRegEx
I'll leave this here in case it's of use for anyone in another language more Perl compatible, but this doesn't answer your question, sorry.
For unicode support you can use the \p{...} pattern.
Your pattern would become
/^\P{L}*(\p{L}+\P{L}*){80,}$/
Here \P{L} stands for anything but a letter, \p{L} for any letter (but not a digit or a _, so it's a little bit different from \w)

What Javascript Regular Expression features are unique to Javascript?

I hope this question isn't too broad, but then again I would expect the Javascript (and other languages) regular expression engine's to share most of it's functionality with what is considered standard / expected regular expression behavior.
I made a statement about C# having unique regular expression capabilities in this post :: RegEx match open tags except XHTML self-contained tags
Specifically, here is the statement:
C# is unique when it comes to regular expressions in that it supports
Balancing Group
Definitions.
See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions
I'm curious what unique regular expression capabilities javascript has if any.

Although JavaScript’s regular expression library supports features that are considered as common (see comparison table), there is one particular expression that I haven’t seen in other:
/[^]/
This matches any arbitrary character similar to /[\s\S]/ (or any other union of complementary character classes) and can be handy as JavaScript does not have a s modifier like others have to have . match line breaks too.
Similar to that:
/[]/
This evaluates to an empty character set and can’t match anything at all.

javascript regexes are a subset of perl regexes.
Meaning, it has no unique features, but it's missing quite a few.

Javascript regular expressions are modeled on Perl's regular expressions.
See: http://www.regular-expressions.info/javascript.html

JavaScript's regex engine is merely a subset of Perl's engine, meaning that it doesn't add anything new and is missing many of the features Perl contains.
You can read more about it here: http://www.regular-expressions.info/javascript.html.

Why this regex is not working for german words?

I am trying to break the following sentence in words and wrap them in span.
<p class="german_p big">Das ist ein schönes Armband</p>
I followed this:
How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>
so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

Unicode in Javascript Regexen
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.
The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.
It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.
This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
Unicode Support in Other Languages
Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.
In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.
SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Sorry 'bout that. ☹

You can also use
/\b([äöüÄÖÜß\w]+)\b/g
instead of
/\b(\w+)\b/g
in order to handle the umlauts

\w only matches A-Z, a-z, 0-9, and _ (underscore).
You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml

To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:
[\w\u00C0-\u024f]

\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.

As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.

the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/

Javascript regex compared to Perl regex

I'm just a noob when it comes to regexp. I know Perl is amazing with regexp and I don't know much Perl. Recently started learning JavaScript and came across regex for
validating user inputs... haven't used them much.
How does JavaScript regexp compare with Perl regexp? Similarities and differences?
Can all regexp(s) written in JS be used in Perl and vice-versa?
Similar syntax?

From ECMAScript 2018 onwards, many of JavaScript's regex deficiencies have been fixed.
It now supports lookbehind assertions, even unbounded ones.
Unicode property escapes have been added.
There finally is a DOTALL (/s) flag.
What is still missing:
JavaScript doesn't have a way to prevent backtracking by making matches final (using possessive quantifiers ++/*+/?+ or atomic groups (?>...)).
Recursive/balanced subgroup matching is not supported.
One other (cosmetic) thing is that JavaScript doesn't know verbose regexes, which might make them harder to read.
Other than that, the basic regex syntax is very similar in both flavors.

This comparison will answer all your queries.

Another difference: In JavaScript, there is no s modifier: The dot "." will never match a newline character. As a replacement for ".", the character class [\s\S] can be used in JavaScript, which will work like /./s in Perl.

I just ran into an instance where the \d, decimal is not recognized in some versions of JavaScript -- you have to use [0-9].

Develop Reference

JavaScript is the programming language of the Web.