Making Regex more safe - javascript

i'm trying to turn a bunch of regex more safe, what i mean by more safe, i want more accuracy.
So, i'm very new with RegExp, and i want know if i'm doing this right (not the Regex, but turn into more safety).
So, i'm starting now, and this is the first RegExp that i want change, i want push the 01/2011.
Past RegExp:
var text = 'INSCRIÇÃO: 60.537.263/0001-66 COMP: 01/2011 COD REC: 150';
var reg = /COMP.*?(\d\S*)/;
var match = reg.exec(text);
console.log(match[1]);
New RegExp:
var text = 'INSCRIÇÃO: 60.537.263/0001-66 COMP: 01/2011 COD REC: 150';
var reg = /COMP:\s([0-9]{0,2}\/[0-9]{0,4})/;
var match = reg.exec(text);
console.log(match[1]);
Why this? This text is just a part of a huge text, so i need accuraci.
Other question is about turn the Regex optional, so if doesn't match anything, return undefined.
Thanks.

According to your feedback:
i want specifically push the value with two numbers, one / and four numbers
You can use
/\bCOMP:\s*(\d{2}\/\d{4})(?!\d)/g
The \b is a word boundary, thus 5COMP won't be matched.
The \s* will match 0 or more whitespace (if there must be whitespace, use + quantifier instead).
The \d{2} will match exactly 2 digits.
The \d{4} will match 4 digits and no more because of the look-ahead (?!\d). This look-ahead just makes sure there is no digit after the 4 previous digits. You may use \b here as well to ensure matching a word boundary.
arr = [];
var re = /\bCOMP:\s*(\d{2}\/\d{4})(?!\d)/g;
var str = 'COMP:10/9995, COMP: 21/1234, COMP: 21/123434, REGCOMP: 21/1234';
var m;
while ((m = re.exec(str)) !== null) {
arr.push(m[1]);
}
console.log(arr);

Related

Regex match apostrophe inside, but not around words, inside a character set

I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word' should be counted simply as word (without the quotes, otherwise they'll behave as two different words), while it's should be counted as a whole word.
(?<=\w)(')(?=\w)
This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w]+.
(?<=\w)(')(?=\w)|[\w]+
Will count it's a 'miracle' of nature as 7 words, instead of 5 (it, ', s becoming 3 different words). Also, the third word should be selected simply as miracle, and not as 'miracle'.
To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ] instead of \w.
How can I accomplish that?
1) You can simply use /[^\s]+/g regex
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g);
console.log(result.length);
console.log(result);
2) If you are calculating total number of words in a string then you can also use split as:
const str = `it's a 'miracle' of nature`;
const result = str.split(/\s+/);
console.log(result.length);
console.log(result);
3) If you want a word without quote at the starting and at the end then you can do as:
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g).map((s) => {
s = s[0] === "'" ? s.slice(1) : s;
s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
return s;
});
console.log(result.length);
console.log(result);
You might use an alternation with 2 capture groups, and then check for the values of those groups.
(?<!\S)'(\S+)'(?!\S)|(\S+)
(?<!\S)' Negative lookbehind, assert a whitespace boundary to the left and match '
(\S+) Capture group 1, match 1+ non whitespace chars
'(?!\S) Match ' and assert a whitespace boundary to the right
| Or
(\S+) Capture group 2, match 1+ non whitespace chars
See a regex demo.
const regex = /(?<!\S)'(\S+)'(?!\S)|(\S+)/g;
const s = "it's a 'miracle' of nature";
Array.from(s.matchAll(regex), m => {
if (m[1]) console.log(m[1])
if (m[2]) console.log(m[2])
});

Multiple OR conditions for words in JavaScript regular expression

I trying to have a regular expression which is finding between two words but those words are not certain one.
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
This is my text. I'm trying to find the word between Soyadı and Sınıfı, in this case ERTANĞA, but the word Sınıfı also can be no, numara or any number. This is what I did.
soyad[ıi](.*)S[ıi]n[ıi]f[ıi]|no|numara|[0-9]
[ıi] is for Turkish character issue, don't mind that.
You can use something like below :
/.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi
Here is the link I worked on : https://regex101.com/r/QXLjLF/1
In JS code:
const regex = /.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi;
var str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303`;
var newStr = str.replace(regex, '');
console.log(newStr);
You can use a single capture group to get the word ERTANĞA, keep the character class [ıi] instead of using an alternation for (ı|i) and group the alternatives at the end of the pattern using a non capture group (?:
soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])
soyad[ıi] Match soyadı or soyadi
(.+?) Capture group 1, match 1 or more chars as least as possible
(?: Non capture group
S[ıi]n[ıi]f[ıi] Match S and then ı or i etc..
| Or
n(?:o|umara) Match either no or numara
| Or
[0-9] Match a digit 0-9
) Close non capture group
Note that you don't need the /m flag as there are no anchors in the pattern.
Regex demo
const regex = /soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])/gi;
const str = "2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303\n";
console.log(Array.from(str.matchAll(regex), m => m[1]));
This might do it
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
console.log([...str.matchAll(re)].map(x => x[2]))
ES5
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
const res = []
let match;
while ((match = re.exec(str)) !== null) res.push(match[2])
console.log(res)

How to slice optional arguments in RegEx?

Actually i have the following RegExp expression:
/^(?:(?:\,([A-Za-z]{5}))?)+$/g
So the accepted input should be something like ,IGORA but even ,IGORA,GIANC,LOLLI is valid and i would be able to slice the string to 3 group in this case, in other the group number should be equals to the user input that pass the RegExp test.
i was trying to do something like this in JavaScript but it return only the last value
var str = ',GIANC,IGORA';
var arr = str.match(/^(?:(?:\,([A-Za-z]{5}))?)+$/).slice(1);
alert(arr);
So the output is 'IGORA' while i would it to be 'GIANC' 'IGORA'
Here is another example
/^([A-Z]{5})(?:(?:\,([A-Za-z]{2}))?)+$/g
test of regexp may have at least 5 chart string but it also can have other 5 chart string separated with a comma so from input
IGORA,CIAOA,POPOP
I would have an array of ["IGORA","CIAOA","POPOP"]
You can capture the words in a capturing surrounded by an optional preceding comma or an optional trailing comma.
You can test the regex here: ,?([A-Za-z]+),?
const pattern = /,?([A-Za-z]+),?/gm;
const str = `,IGORA,GIANC,LOLLI`;
let matches = [];
let match;
// Iterate until no match found
while ((m = pattern.exec(str))) {
// The first captured group is the match
matches.push(m[1]);
}
console.log(matches);
There are other ways to do this, but I found that one of the simple ways is by using the replace method, as it can replace all instances that match that regex.
For example:
var regex = /^(?:(?:\,([A-Za-z]{5}))?)+$/g;
var str = ',GIANC,IGORA';
var arr = [];
str.replace(regex, function(match) {
arr[arr.length] = match;
return match;
});
console.log(arr);
Also, in my code snippet you can see that there is an extra coma in each string, you can solve that by changing line 5 to arr[arr.length] = match.replace(/^,/, '').
Is this what you're looking for?
Explanation:
\b word boundary (starting or ending a word)
\w a word ([A-z])
{5} 5 characters of previous
So it matches all 5-character words but not NANANANA
var str = 'IGORA,CIAOA,POPOP,NANANANA';
var arr = str.match(/\b\w{5}\b/g);
console.log(arr); //['IGORA', 'CIAOA', 'POPOP']
If you only wish to select words separated by commas and nothing else, you can test for them like so:
(?<=,\s*|^) preceded by , with any number of trailing space, OR is the first word in list.
(?=,\s*|$) followed by , and any number of trailing spaces OR is last word in list.
In the following code, POPOP and MOMMA are rejected because they are not separated by a comma, and NANANANA fails because it is not 5 character.
var str = 'IGORA, CIAOA, POPOP MOMMA, NANANANA, MEOWI';
var arr = str.match(/(?<=,\s*|^)\b\w{5}\b(?=,\s*|$)/g);
console.log(arr); //['IGORA', 'CIAOA', 'MEOWI']
If you can't have any trailing spaces after the comma, just leave out the \s* from both (?<=,\s*|^) and (?=,\s*|$).

Matching whole words with Javascript's Regex with a few restrictions

I am trying to create a regex that can extract all words from a given string that only contain alphanumeric characters.
Yes
yes absolutely
#no
*NotThis
orThis--
Good *Bad*
1ThisIsOkay2 ButNotThis2)
Words that should have been extracted: Yes, yes, absolutely, Good, 1ThisIsOkay2
Here is the work I have done thus far:
/(?:^|\b)[a-zA-Z0-9]+(?=\b|$)/g
I had found this expression that works in Ruby ( with some tweaking ) but I have not been able to convert it to Javascript regex.
Use /(?:^|\s)\w+(?!\S)/g to match 1 or more word chars in between start of string/whitespace and another whitespace or end of string:
var s = "Yes\nyes absolutely\n#no\n*NotThis\norThis-- \nGood *Bad*\n1ThisIsOkay2 ButNotThis2)";
var re = /(?:^|\s)\w+(?!\S)/g;
var res = s.match(re).map(function(m) {
return m.trim();
});
console.log(res);
Or another variation:
var s = "Yes\nyes absolutely\n#no\n*NotThis\norThis-- \nGood *Bad*\n1ThisIsOkay2 ButNotThis2)";
var re = /(?:^|\s)(\w+)(?!\S)/g;
var res = [];
while ((m=re.exec(s)) !== null) {
res.push(m[1]);
}
console.log(res);
Pattern details:
(?:^|\s) - either start of string or whitespace (consumed, that is why trim() is necessary in Snippet 1)
\w+ - 1 or more word chars (in Snippet 2, captured into Group 1 used to populate the resulting array)
(?!\S) - negative lookahead failing the match if the word chars are not followed with non-whitespace.
You can do that (where s is your string) to match all the words:
var m = s.split(/\s+/).filter(function(i) { return !/\W/.test(i); });
If you want to proceed to a replacement, you can do that:
var res = s.split(/(\s+)/).map(function(i) { return i.replace(/^\w+$/, "#");}).join('');

RexExp in javascript dont match a number inside a string

Im learning Regular Expresions in Javascript and there is a thing that i dont understand.
The following regexp should match any string from a to z but if I add a number it says that is correct
var patron = /[a-zA-Z]/;
var regex = new RegExp(patron);
var v= "hello word 512";
if(v.match(regex))
{
//should not match but it does
}else
{
objInput.style.color = "red";
}
And them i tried this:
var patron = /[a-zA-Z\D]/;
var regex = new RegExp(patron);
var v= "hello word 512";
if(v.match(regex))
{
//should not match but still dont work
}else
{
objInput.style.color = "red";
}
And also, parentheses are not being match
var patron = /[a-zA-Z\"\']/;
var regex = new RegExp(patron);
var v= "hello word 512";
if(v.match(regex))
{
//it match whenever the double quoute it followed by the single quoute'
}else
{
objInput.style.color = "red";
}
About the first example you provided, your regex /[a-zA-Z]/ checks for any character in the input string. Since it finds h in your input string, it returns true.
What you need to do is place start and end anchors, ^ and $ in your regex. New regex would look like this:
/^[a-zA-Z]+$/
You can make changes to all you regex accordingly.
To match parentheses, you need to escape them with a backslash. \( would match (, and \) would match ).
You should match the whole string, using the ^ (matches the beginning of the string) and $ (matches the end of the string) operators, for example:
/^[a-zA-Z\s]+$/.test("any string followed by numbers! 555") // will return false
This will not allow anything else than a-z chars and spaces in your string.
the match function seeks for at least ONE match in your case this is 1st symbol which is a char.
if you want ONLY chars then use /[a-zA-Z]/.test("your string")

Categories

Resources