RegEx: found all comments but not within quotation marks - javascript

I have thought of an example that I can use to learn it well. I designed a kind of "script language" (as a string) that I want to parse and interpret.
PS: Yes, it reminds of LINQ a bit, but that's just a coincidence.
The first thing I thought about, I want to remove all the comments, because these shouldn't be interpreted.
I only look for comments like: /*...*/ and //...\n
However, these should of course not happen within quotation marks: "..." and '...'
But how can I use RegEx to find comments that are not inside quotation marks?
String:
//get means only read, but not to mutate data
Get(BooksWithAuthors)
//default queries via mycel
.Query()
//junction table to pair books and authors
.From(BookAuthor.As(BA))
//main table for books
.Join(left: Books.As(B) => B.Id == BA.BookId)
//main table for authors
.Join(left: Authors.As(A) => A.Id == BA.AuthorId)
//groups by column, body allows to restore data (restructuring)
.GroupBy(B.Id, => B.Authors.Add(A))
//ignore still registerd data objects for the response
.SelectIgnore(BA)
//or select only that fields or objects you want to response
.Select(B)
.Foo("//wrong-comment-inside-quotes")
.Foo('//wrong-comment-inside-single-quotes')
.Foo('some /*wrong-comment*/ inside')
;
//get means only read, but not to mutate data
Get(BooksWithAuthorsByMethod)
//using individual backside methods (created by own)
.GetBooksWithAuthors(id:6, filter:{key:'minAuthorAge', value:17})
;
/*
comments
"over"
'multiply
lines' //with wrong comments inside
*\
RegEx:
.*[^'"].*([\/]{2}.*[\r\n|\r|\n]).*[^'"].*
(https://regex101.com/r/zPzBFj/1)
Yeah, I tried it only with //, but not every incidence was found and it also matches that comments within quotation marks. Maybe ?! is not the right way. But how can I do that?
I'm sure I'll have one or two more questions about this example. But as I said, I'm still learning RegEx, so step by step...

If you match the string with the regular expression
/'.*?'|".*?"|(\/\/[^\r\n]*|\/\*.*?\*\/)/gs
comments will be saved to capture group 1. The idea is to match but not capture what you don't want and match and capture what you do want. Pay no attention to matches that are not captured.
Without the DOTALL flag (/s) periods match all characters other than line terminators; with that flag set periods match all characters, including line terminators.
Demo
At the demo link matches that are not captured (not comments, so disregard) are shown in blue, whereas matches that are captured (comments) are shown in green.
The regular expression can be broken down as follows.
'.*?' # match a single-quote followed by >= 0 chars, lazily,
# followed by a single-quote
| # or
".*?" # match a double-quote followed by >= 0 chars, lazily,
# followed by a double-quote
| # or
( # begin capture group 1
\/\/ # match '//'
[^\r\n]* # match >= 0 chars other than line terminators
| # or
\/\* # match '/*'
.*? # match >= 0 chars, lazily
\*\/ # match '*/'
) # end capture group 1
Here is an example of how this works. Suppose the string were as follows.
A dog "is // a\nman's" /* best */ 'friend /* so it */ is' // said
The regex engine performs the following steps.
Fail to match A.
Fail to match after A, then fail to match d, o, g and .
Match but do not capture "is // a\nman's".1
Fail to match .
Match and capture the comment /* best */.
Fail to match .
Match but do not capture 'friend /* so it */ is'.
Fail to match .
Match and capture the comment // said
1. After this match the regex engine's string pointer is between the (last) double-quote just matched and the following space.

This returns what you're looking for in the example, let me know if you find any edge cases. You'll have to post-process the matches based on whether it's a comment or quoted string.
(?:(?:(\/)(\*)|(["'])).*?(?:\2\1|\3))|(?:\/\/[^\n]+)
https://regex101.com/r/uqx1cJ/1

Related

exclude full word with javascript regex word boundary

I'am looking to exclude matches that contain a specific word or phrase. For example, how could I match only lines 1 and 3? the \b word boundary does not work intuitively like I expected.
foo.js # match
foo_test.js # do not match
foo.ts # match
fun_tset.js # match
fun_tset_test.ts # do not match
UPDATE
What I want to exclude is strings ending explicitly with _test before the extension. At first I had something like [^_test], but that also excludes any combination of those characters (like line 3).
Regex: ^(?!.*_test\.).*$
Working examples: https://regex101.com/r/HdGom7/1
Why it works: uses negative lookahead to check if _test. exists somewhere in the string, and if so doesn't match it.
Adding to #pretzelhammer's answer, it looks like you want to grab strings that are file names ending in ts or js:
^(?!.*_test)(.*\.[jt]s)
The expression in the first parentheses is a negative lookahead that excludes any strings with _test, the second parentheses matches any strings that end in a period, followed by [jt] (j or t), followed by s.

Extract specific string using RegExp

I have a string something like this,
Example 1 #abc#xy-example.com and #xyz#abc.com Example 2
Now, I want to remove the string occurring after the second # encountered, i.e. #example.com #abc.com and preserve the rest of the data which should look like,
Example 1 #abc and #xyz Example 2
I have tried a lot of RegExp and saw many examples but have had no luck so far.
If anyone has tried something similar, it'd be great if you can help me out.
For the second half of your sample you can simply match both #s and replace only the second on, by grouping and using them in the replace.
Pattern: /(#[^\s#]*)#[^\s#]*/g
Replacement: '$1'
This matches an # followed by anything but spaces and # and stores it in group 1. It then matches the next # and again anythihng but spaces and #.
If there might be other stuff between both #s, you could adjust your pattern to use (#[^#]*) for the capturing group.
Four the first part of your sample, you would have to find a better pattern to match what follows the second #, this could be something along #[^\s#<]*(?:<[^<>]*>[^<>]*<\/[^<>]*>) but I'm not quite sure about your requirements and matching along tags is always tricky.
Capture the first # followed by non-# characters in a group, then match # again followed by non-space characters, and replace with the first captured group:
(#[^#]+?)#[^ ]+?(?= )
Result:
Example 1 #abc and #xyz Example 2
https://regex101.com/r/mXRlsZ/1
Note that this will also replace any #s in a row, past the first - eg #abc#xy#foo will become #abc
Try this pattern: ^[^#]*#[^#]+(?=#). It will match everything before second #. It anchors at the beginning of a string, first it matches everything except #: [^#], then matches #, then again matches everything except #: [^#], until next # is met: (?=#) (positive lookahead).
Demo
string input = #"Example 1 #abc#xy-example.com and #xyz#abc.com Example 2";
string pattern = #"#.+?(?=#|$)";
int x = 0;
string s = Regex.Replace(
input,
pattern,
m => ++x == 2 ? Regex.Match(m.Value, #">(\s+.+?\s+)$").Groups[1].Value : m.Value);
Explanation
First, start with searching for # symbol. Variable x is tracking the number of occurrences of #. As soon as it hits 2, then we extract everything between the end and >. If x doesn't equal to 2, then we just return the match (m.Value);

JS Regex: Remove anything (ONLY) after a word

I want to remove all of the symbols (The symbol depends on what I select at the time) after each word, without knowing what the word could be. But leave them in before each word.
A couple of examples:
!!hello! my! !!name!!! is !!bob!! should return...
!!hello my !!name is !!bob ; for !
and
$remove$ the$ targetted$# $$symbol$$# only $after$ a $word$ should return...
$remove the targetted# $$symbol# only $after a $word ; for $
You need to use capture groups and replace:
"!!hello! my! !!name!!! is !!bob!!".replace(/([a-zA-Z]+)(!+)/g, '$1');
Which works for your test string. To work for any generic character or group of characters:
var stripTrailing = trail => {
let regex = new RegExp(`([a-zA-Z0-9]+)(${trail}+)`, 'g');
return str => str.replace(regex, '$1');
};
Note that this fails on any characters that have meaning in a regular expression: []{}+*^$. etc. Escaping those programmatically is left as an exercise for the reader.
UPDATE
Per your comment I thought an explanation might help you, so:
First, there's no way in this case to replace only part of a match, you have to replace the entire match. So we need to find a pattern that matches, split it into the part we want to keep and the part we don't, and replace the whole match with the part of it we want to keep. So let's break up my regex above into multiple lines to see what's going on:
First we want to match any number of sequential alphanumeric characters, that would be the 'word' to strip the trailing symbol from:
( // denotes capturing group for the 'word'
[ // [] means 'match any character listed inside brackets'
a-z // list of alpha character a-z
A-Z // same as above but capitalized
0-9 // list of digits 0 to 9
]+ // plus means one or more times
)
The capturing group means we want to have access to just that part of the match.
Then we have another group
(
! // I used ES6's string interpolation to insert the arg here
+ // match that exclamation (or whatever) one or more times
)
Then we add the g flag so the replace will happen for every match in the target string, without the flag it returns after the first match. JavaScript provides a convenient shorthand for accessing the capturing groups in the form of automatically interpolated symbols, the '$1' above means 'insert contents of the first capture group here in this string'.
So, in the above, if you replaced '$1' with '$1$2' you'd see the same string you started with, if you did 'foo$2' you'd see foo in place of every word trailed by one or more !, etc.

Regular expression for excluding some characters with multiline matching

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:
var pattern = /^((?!&#|<|>).)*$/m;
The problem is, that it still matches multiline strings from a textarea like
this text matches
though this should not, because of this character <
EDIT:
To be more clear, I need exclude &# combination only, not & or #.
Please suggest the solution. Very grateful.
You're probably not looking for m (multiline) switch but s (DOTALL) switch in Javascript. Unfortunately s doesn't exist in Javascript.
However good news that DOTALL can be simulated using [\s\S]. Try following regex:
/^(?![\s\S]*?(&#|<|>))[\s\S]*$/
OR:
/^((?!&#|<|>)[\s\S])*$/
Live Demo
I don't think you need a lookaround assertion in this case. Simply use a negated character class:
var pattern = /^[^<>&#]*$/m;
If you're also disallowing the following characters, -, [, ], make sure to escape them or put them in proper order:
var pattern = /^[^][<>&#-]*$/m;
Alternate answer to specific question:
anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:
function hasSpecial_1(text) {
// If regex matches, then string does NOT contain special chars.
return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
// If regex matches, then string contains (at least) one special char.
return /[<>]|&#/.test(text) ? true : false;
}
These two functions are functionally equivalent, but the second one is probably quite a bit faster.
Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.
Test if a string contains HTML special Chars:
It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as:  ,  , etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &, <, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.
Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)
FALSE test using positive regex:
function hasHtmlSpecial_1(text) {
/* Commented regex:
# Match string having no special HTML chars.
^ # Anchor to start of string.
[^<>&]* # Zero or more non-[<>&] (normal*).
(?: # Unroll the loop. ((special normal*)*)
& # Allow a & but only if
(?! # not an HTML entity (3 valid types).
(?: # One from 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
) # End negative lookahead.
[^<>&]* # More (normal*).
)* # End unroll the loop.
$ # Anchor to end of string.
*/
var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
// If regex matches, then string does NOT contain HTML special chars.
return re.test(text) ? false : true;
}
Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))
TRUE test using negative regex:
function hasHtmlSpecial_2(text) {
/* Commented regex:
# Match string having one special HTML char.
[<>] # Either a tag delimiter
| & # or a & if start of
(?: # one of 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
*/
var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
// If regex matches, then string contains (at least) one special HTML char.
return re.test(text) ? true : false;
}
Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

Javascript multiple regex pattern

I'm trying to exclude some internal IP addresses and some internal IP address formats from viewing certain logos and links in the site.I have multiple range of IP addresses(sample given below). Is it possible to write a regex that could match all the IP addresses in the list below using javascript?
10.X.X.X
12.122.X.X
12.211.X.X
64.X.X.X
64.23.X.X
74.23.211.92
and 10 more
Quote the periods, replace the X's with \d+, and join them all together with pipes:
const allowedIPpatterns = [
"10.X.X.X",
"12.122.X.X",
"12.211.X.X",
"64.X.X.X",
"64.23.X.X",
"74.23.211.92" //, etc.
];
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
const allowedRegexp = new RegExp(allowedRegexStr);
Then you're all set:
'10.1.2.3'.match(allowedRegexp) // => ['10.1.2.3']
'100.1.2.3'.match(allowedRegexp) // => null
How it works:
First, we have to turn the individual IP patterns into regular expressions matching their intent. One regular expression for "all IPs of the form '12.122.X.X'" is this:
^12\.122\.\d+\.\d+$
^ means the match has to start at the beginning of the string; otherwise, 112.122.X.X IPs would also match.
12 etc: digits match themselves
\.: a period in a regex matches any character at all; we want literal periods, so we put a backslash in front.
\d: shorthand for [0-9]; matches any digit.
+: means "1 or more" - 1 or more digits, in this case.
$: similarly to ^, this means the match has to end at the end of the string.
So, we turn the IP patterns into regexes like that. For an individual pattern you could use code like this:
const regexStr = `^` + ipXpattern.
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
`$`;
Which just replaces all .s with \. and Xs with \d+ and sticks the ^ and $ on the ends.
(Note the doubled backslashes; both string parsing and regex parsing use backslashes, so wherever we want a literal one to make it past the string parser to the regular expression parser, we have to double it.)
In a regular expression, the alternation this|that matches anything that matches either this or that. So we can check for a match against all the IP's at once if we to turn the list into a single regex of the form re1|re2|re3|...|relast.
Then we can do some refactoring to make the regex matcher's job easier; in this case, since all the regexes are going to have ^...$, we can move those constraints out of the individual regexes and put them on the whole thing: ^(10\.\d+\.\d+\.\d+|12\.122\.\d+\.\d+|...)$. The parentheses keep the ^ from being only part of the first pattern and $ from being only part of the last. But since plain parentheses capture as well as group, and we don't need to capture anything, I replaced them with the non-grouping version (?:..).
And in this case we can do the global search-and-replace once on the giant string instead of individually on each pattern. So the result is the code above:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
That's still just a string; we have to turn it into an actual RegExp object to do the matching:
const allowedRegexp = new RegExp(allowedRegexStr);
As written, this doesn't filter out illegal IPs - for instance, 10.1234.5678.9012 would match the first pattern. If you want to limit the individual byte values to the decimal range 0-255, you can use a more complicated regex than \d+, like this:
(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])
That matches "any one or two digits, or '1' followed by any two digits, or '2' followed by any of '0' through '4' followed by any digit, or '25' followed by any of '0' through '5'". Replacing the \d with that turns the full string-munging expression into this:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '(?:\\d{1,2}|1\\d{2}|2[0-4]\\d|25[0-5])') +
')$';
And makes the actual regex look much more unwieldy:
^(?:10\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5]).(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])|12\.122\....
but you don't have to look at it, just match against it. :)
You could do it in regex, but it's not going to be pretty, especially since JavaScript doesn't even support verbose regexes, which means that it has to be one humongous line of regex without any comments. Furthermore, regexes are ill-suited for matching ranges of numbers. I suspect that there are better tools for dealing with this.
Well, OK, here goes (for the samples you provided):
var myregexp = /\b(?:74\.23\.211\.92|(?:12\.(?:122|211)|64\.23)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])|(?:10|64)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]))\b/g;
As a verbose ("readable") regex:
\b # start of number
(?: # Either match...
74\.23\.211\.92 # an explicit address
| # or
(?: # an address that starts with
12\.(?:122|211) # 12.122 or 12.211
| # or
64\.23 # 64.23
)
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
| # or
(?:10|64) # match 10 or 64
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
)
\b # end of number
/^(X|\d{1,3})(\.(X|\d{1,3})){3}$/ should do it.
If you don't actually need to match the "X" character you could use this:
\b(?:\d{1,3}\.){3}\d{1,3}\b
Otherwise I would use the solution cebarrett provided.
I'm not entirely sure of what you're trying to achieve here (doesn't look anyone else is either).
However, if it's validation, then here's a solution to validate an IP address that doesn't use RegEx. First, split the input string at the dot. Then using parseInt on the number, make sure it isn't higher than 255.
function ipValidator(ipAddress) {
var ipSegments = ipAddress.split('.');
for(var i=0;i<ipSegments.length;i++)
{
if(parseInt(ipSegments[i]) > 255){
return 'fail';
}
}
return 'match';
}
Running the following returns 'match':
document.write(ipValidator('10.255.255.125'));
Whereas this will return 'fail':
document.write(ipValidator('10.255.256.125'));
Here's a noted version in a jsfiddle with some examples, http://jsfiddle.net/VGp2p/2/

Categories

Resources