Regex matching JS source that's not in a string or regex literal - javascript

Do there exist comprehensive regular expressions that, when applied to JavaScript source code, will match all valid string literals (such as "say \"Hello\"") and regex literals (such as /and\/or/)? The expressions would have to cover all edge cases, including line breaks and escape sequences.
Alternatively, does anyone know of regexes for matching patterns outside of string and regex literals?
My goal is to implement a simple JavaScript syntax extension that allows macros in delimeters (e.g. {{#foo.bar}} or ##foo.bar#) to be expanded by a preprocessor. However, I'd like the macros to be processed only outside of literals.
For now, I'm trying to accomplish this using just string replacement, without having to augment an existing JavaScript lexer/parser.
This JavaScript preprocessor will itself be implemented in JavaScript.

This is the regex that I've been using to match quoted strings which is pretty good since it should work with almost all engines since it does not require backtracking or backreferences or any of that voodoo. This will match all text INSIDE literals.
"(\\.|[^"])*"
Depending on the engine, it might support non capturing groups. In that case you can use
"(?:\\.|[^"])*"
and it should be faster.

I think this is too much for regexes.
Consider var foo = "//" // /"(?:\\.|[^"])*"/. Where do the strings, comments and regex literals start and end? You would need to write a complete JavaScript parser to cover all edge cases. Of course, the parser will be using regexes...

I would probably go about doing something like the following. It will need to be improved for certain possible conditions, though.
var str = '"aaa \"sss \\t bbb" sss #3 ss# ((t sdsds)) ff ';
str += '/gg sdfd \/dsds/ {aaa bbb} {{ss}} {#sdsd#}';
var repeating = ['"','\\\'','/','\\~','\\#'];
// "example" 'example' /example/ ~example~ #example#
var enclosing = [];
enclosing.push(['\\{','\\}']);
enclosing.push(['\\{\\{','\\}\\}']);
enclosing.push(['\\[','\\]']);
enclosing.push(['\\(\\(','\\)\\)']);
// {example} {{example}} [example] ((example))
for (var forEnclosing='',i = 0 ; i < enclosing.length; i++) {
var e = enclosing[i];
var r = e[0]+'(\\\\['+e[0]+e[1]+']|[^'+e[0]+e[1]+'])*'+e[1];
forEnclosing += r + (i < enclosing.length-1 ? '|' : '');
}
for (var forRepeating='',i = 0; i < repeating.length; i++) {
var e = repeating[i];
var r = e+'(\\'+e+'|[^'+e+'])*'+e;
forRepeating += r + (i < repeating.length-1 ? '|' : '');
}
var rx = new RegExp('('+forEnclosing+'|'+forRepeating+')','g');
var m = str.match(rx);
try { for (var i = 0; i < m.length; i++) console.log(m[i]) }
catch(e) {}
Outputs:
"aaa "sss \t bbb"
#3 ss#
((t sdsds))
/gg sdfd /dsds/
{aaa bbb}
{{ss}}
{#sdsd#}

The closest you can get with a regex is to have one regex that matches EITHER a string literal (single- or double-quoted) OR a regex OR a comment (OR whatever else might contain bogus matches) OR one of your macro thingies:
"[^"\\]*(?:\\.[^"\\]*)*"
|
'[^'\\]*(?:\\.[^'\\]*)*'
|
/[^/\\]*(?:\\.[^/\\]*)*/[gim]*
|
/\*[^*]*(?:\*(?!/)[^*]*)*\*/
|
##(\w+\.\w+)#
If group #1 contains anything after the match, it must be what you're looking for. Otherwise, ignore this match and go on to the next one.

Related

Trying to design a WORD SEARCH puzzle with Unicode Letters (TAMIL) Using HTML and JAVASCRIPT [duplicate]

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).
JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).
To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.
I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).
Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.
For the purposes of this question I do not require splitting by grapheme cluster.
#bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.
Along the lines of #John Frazer's answer, one can use this even succincter form of string iteration:
const chars = [...text]
e.g., with:
const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "𝑪"]
In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.
Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.
So doing it the manual way:
String.prototype.toCodePoints= function() {
chars = [];
for (var i= 0; i<this.length; i++) {
var c1= this.charCodeAt(i);
if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
var c2= this.charCodeAt(i+1);
if (c2>=0xDC00 && c2<0xE000) {
chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
i++;
continue;
}
}
chars.push(c1);
}
return chars;
}
For the inverse to this see https://stackoverflow.com/a/3759300/18936
Another method using codePointAt:
String.prototype.toCodePoints = function () {
var arCP = [];
for (var i = 0; i < this.length; i += 1) {
var cP = this.codePointAt(i);
arCP.push(cP);
if (cP >= 0x10000) {
i += 1;
}
}
return arCP;
}

Regex split comma except escaped [duplicate]

I have this string:
a\,bcde,fgh,ijk\,lmno,pqrst\,uv
I need a JavaScript function that will split the string by every , but only those that don't have a \ before them
How can this be done?
Here's the shortest thing I could come up with:
'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv'.replace(/([^\\]),/g, '$1\u000B').split('\u000B')
The idea behind is to find every place where comma isn't prefixed with a backslash, replace those with string that is uncommon to come up in your strings and then split by that uncommon string.
Note that backslashes before commas have to be escaped using another backslash. Otherwise, javascript treats form \, as escaped comma and produce simply a comma out of it! In other words if you won't escape the backslash, javascript sees this: a\,bcde,fgh,ijk\,lmno,pqrst\,uv as this a,bcde,fgh,ijk,lmno,pqrst,uv.
Since regular expressions in JavaScript does not support lookbehinds, I'm not going to cook up a giant hack to mimic this behavior. Instead, you can just split() on all commas (,) and then glue back the pieces that shouldn't have been split in the first place.
Quick 'n' dirty demo:
var str = 'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv'.split(','), // Split on all commas
out = []; // Output
for (var i = 0, j = str.length - 1; i < j; i++) { // Iterate all but last (last can never be glued to non-existing next)
var curr = str[i]; // This piece
if (curr.charAt(curr.length - 1) == '\\') { // If ends with \ ...
curr += ',' + str[++i]; // ... glue with next and skip next (increment i)
}
out.push(curr); // Add to output
}
Another ugly hack around the lack of look-behinds:
function rev(s) {
return s.split('').reverse().join('');
}
var s = 'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv';
// Enter bizarro world...
var r = rev(s);
// Split with a look-ahead
var rparts = r.split(/,(?!\\)/);
// And put it back together with double reversing.
var sparts = [ ];
while(rparts.length)
sparts.push(rev(rparts.pop()));
for(var i = 0; i < sparts.length; ++i)
$('#out').append('<pre>' + sparts[i] + '</pre>');
Demo: http://jsfiddle.net/ambiguous/QbBfw/1/
I don't think I'd do this in real life but it works even if it does make me feel dirty. Consider this a curiosity rather than something you should really use.
In case if need remove backslashes also:
var test='a\\.b.c';
var result = test.replace(/\\?\./g, function (t) { return t == '.' ? '\u000B' : '.'; }).split('\u000B');
//result: ["a.b", "c"]
In 2022 most of browsers support lookbehinds:
https://caniuse.com/js-regexp-lookbehind
Safari should be your only concern.
With a lookbehind you can split your string this way:
"a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv".split(/(?<!\\),/)
// => ['a\\,bcde', 'fgh', 'ijk\\,lmno', 'pqrst\\,uv']
You can use regex to do the split.
Here is the link to regex in javascript http://www.w3schools.com/jsref/jsref_obj_regexp.asp
Here is the link to other post where the author have used regex for split Javascript won't split using regex
From the first link if you note you can create a regular expression using
?!n Matches any string that is not followed by a specific string n
[,]!\\

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).
JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).
To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.
I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).
Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.
For the purposes of this question I do not require splitting by grapheme cluster.
#bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.
Along the lines of #John Frazer's answer, one can use this even succincter form of string iteration:
const chars = [...text]
e.g., with:
const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "𝑪"]
In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.
Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.
So doing it the manual way:
String.prototype.toCodePoints= function() {
chars = [];
for (var i= 0; i<this.length; i++) {
var c1= this.charCodeAt(i);
if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
var c2= this.charCodeAt(i+1);
if (c2>=0xDC00 && c2<0xE000) {
chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
i++;
continue;
}
}
chars.push(c1);
}
return chars;
}
For the inverse to this see https://stackoverflow.com/a/3759300/18936
Another method using codePointAt:
String.prototype.toCodePoints = function () {
var arCP = [];
for (var i = 0; i < this.length; i += 1) {
var cP = this.codePointAt(i);
arCP.push(cP);
if (cP >= 0x10000) {
i += 1;
}
}
return arCP;
}

Javascript Remove strings in beginning and end

base on the following string
...here..
..there...
.their.here.
How can i remove the . on the beginning and end of string like the trim that removes all spaces, using javascript
the output should be
here
there
their.here
These are the reasons why the RegEx for this task is /(^\.+|\.+$)/mg:
Inside /()/ is where you write the pattern of the substring you want to find in the string:
/(ol)/ This will find the substring ol in the string.
var x = "colt".replace(/(ol)/, 'a'); will give you x == "cat";
The ^\.+|\.+$ in /()/ is separated into 2 parts by the symbol | [means or]
^\.+ and \.+$
^\.+ means to find as many . as possible at the start.
^ means at the start; \ is to escape the character; adding + behind a character means to match any string containing one or more that character
\.+$ means to find as many . as possible at the end.
$ means at the end.
The m behind /()/ is used to specify that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.
The g behind /()/ is used to perform a global match: so it find all matches rather than stopping after the first match.
To learn more about RegEx you can check out this guide.
Try to use the following regex
var text = '...here..\n..there...\n.their.here.';
var replaced = text.replace(/(^\.+|\.+$)/mg, '');
Here is working Demo
Use Regex /(^\.+|\.+$)/mg
^ represent at start
\.+ one or many full stops
$ represents at end
so:
var text = '...here..\n..there...\n.their.here.';
alert(text.replace(/(^\.+|\.+$)/mg, ''));
Here is an non regular expression answer which utilizes String.prototype
String.prototype.strim = function(needle){
var first_pos = 0;
var last_pos = this.length-1;
//find first non needle char position
for(var i = 0; i<this.length;i++){
if(this.charAt(i) !== needle){
first_pos = (i == 0? 0:i);
break;
}
}
//find last non needle char position
for(var i = this.length-1; i>0;i--){
if(this.charAt(i) !== needle){
last_pos = (i == this.length? this.length:i+1);
break;
}
}
return this.substring(first_pos,last_pos);
}
alert("...here..".strim('.'));
alert("..there...".strim('.'))
alert(".their.here.".strim('.'))
alert("hereagain..".strim('.'))
and see it working here : http://jsfiddle.net/cettox/VQPbp/
Slightly more code-golfy, if not readable, non-regexp prototype extension:
String.prototype.strim = function(needle) {
var out = this;
while (0 === out.indexOf(needle))
out = out.substr(needle.length);
while (out.length === out.lastIndexOf(needle) + needle.length)
out = out.slice(0,out.length-needle.length);
return out;
}
var spam = "this is a string that ends with thisthis";
alert("#" + spam.strim("this") + "#");
Fiddle-ige
Use RegEx with javaScript Replace
var res = s.replace(/(^\.+|\.+$)/mg, '');
We can use replace() method to remove the unwanted string in a string
Example:
var str = '<pre>I'm big fan of Stackoverflow</pre>'
str.replace(/<pre>/g, '').replace(/<\/pre>/g, '')
console.log(str)
output:
Check rules on RULES blotter

regular expression javascript returning unexpected results

In the below code, I want to validate messageText with first validationPattern and display the corresponding message from the validationPatterns array. Pattern and Message are separated by Pipe "|" character.
for this I am using the below code and always getting wrong result. Can some one look at this and help me?
var messageText = "Message1234";
var validationPatterns = [
['\/^.{6,7}$/|message one'],
['\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b|message two']
];
for (var i = 0; i < validationPatterns.length; i++) {
var validationvalues = validationPatterns[i].toString();
var expr = validationvalues.split("|")[0];
console.log(expr.constructor);
if(expr.test(messageText)) {
console.log("yes");
} else {
console.log("no");
}
}
I know that we cannot use pipe as separator as pipe is also part of regular expression. However I will change that later.
Your validationpatterns are strings. That means:
The backslashes get eaten as they just string-escape the following characters. "\b" is equivalent to "b". You would need to double escape them: "\\b"
You cannot call the test method on them. You would need to construct RegExp objects out of them.
While it's possible to fix this, it would be better if you just used regex literals and separated them from the message as distinct properties of an object (or in an array).
var inputText = "Message1234";
var validationPatterns = [
[/^.{6,7}$/, 'message one'],
[/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/, 'message two']
];
for (var i = 0; i < validationPatterns.length; i++) {
var expr = validationPatterns[i][0],
message = validationPatterns[i][1];
console.log(expr.constructor); // RegExp now, not String
if(expr.test(inputText)) {
console.log(message+": yes");
} else {
console.log(message+": no");
}
}
Your expr variable is still just a string (validationvalues.split("|")[0] will return a string). That's the reason it does not work as a regular expression.
You need to add a line after the initial definition of expr.
expr = new RegExp(expr, 'i');
The 'i' is just an example of how you would use a case-insensitive flag or other flags. Use an empty string if you want a case-sensitive search (the default).
Also, you need to take out the / and / which are surrounding your first pattern. They are only needed when using regular expression literals in JavaScript code, and are not needed when converting strings into regular expressions.

Categories

Resources