Regex match apostrophe inside, but not around words, inside a character set - javascript

I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word' should be counted simply as word (without the quotes, otherwise they'll behave as two different words), while it's should be counted as a whole word.
(?<=\w)(')(?=\w)
This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w]+.
(?<=\w)(')(?=\w)|[\w]+
Will count it's a 'miracle' of nature as 7 words, instead of 5 (it, ', s becoming 3 different words). Also, the third word should be selected simply as miracle, and not as 'miracle'.
To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ] instead of \w.
How can I accomplish that?

1) You can simply use /[^\s]+/g regex
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g);
console.log(result.length);
console.log(result);
2) If you are calculating total number of words in a string then you can also use split as:
const str = `it's a 'miracle' of nature`;
const result = str.split(/\s+/);
console.log(result.length);
console.log(result);
3) If you want a word without quote at the starting and at the end then you can do as:
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g).map((s) => {
s = s[0] === "'" ? s.slice(1) : s;
s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
return s;
});
console.log(result.length);
console.log(result);

You might use an alternation with 2 capture groups, and then check for the values of those groups.
(?<!\S)'(\S+)'(?!\S)|(\S+)
(?<!\S)' Negative lookbehind, assert a whitespace boundary to the left and match '
(\S+) Capture group 1, match 1+ non whitespace chars
'(?!\S) Match ' and assert a whitespace boundary to the right
| Or
(\S+) Capture group 2, match 1+ non whitespace chars
See a regex demo.
const regex = /(?<!\S)'(\S+)'(?!\S)|(\S+)/g;
const s = "it's a 'miracle' of nature";
Array.from(s.matchAll(regex), m => {
if (m[1]) console.log(m[1])
if (m[2]) console.log(m[2])
});

Related

Multiple OR conditions for words in JavaScript regular expression

I trying to have a regular expression which is finding between two words but those words are not certain one.
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
This is my text. I'm trying to find the word between Soyadı and Sınıfı, in this case ERTANĞA, but the word Sınıfı also can be no, numara or any number. This is what I did.
soyad[ıi](.*)S[ıi]n[ıi]f[ıi]|no|numara|[0-9]
[ıi] is for Turkish character issue, don't mind that.
You can use something like below :
/.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi
Here is the link I worked on : https://regex101.com/r/QXLjLF/1
In JS code:
const regex = /.*Soyad(ı|i)|S(ı|i)n(ı|i)f(ı|i).*|no.*|numera.*|[0-9]/gmi;
var str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303`;
var newStr = str.replace(regex, '');
console.log(newStr);
You can use a single capture group to get the word ERTANĞA, keep the character class [ıi] instead of using an alternation for (ı|i) and group the alternatives at the end of the pattern using a non capture group (?:
soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])
soyad[ıi] Match soyadı or soyadi
(.+?) Capture group 1, match 1 or more chars as least as possible
(?: Non capture group
S[ıi]n[ıi]f[ıi] Match S and then ı or i etc..
| Or
n(?:o|umara) Match either no or numara
| Or
[0-9] Match a digit 0-9
) Close non capture group
Note that you don't need the /m flag as there are no anchors in the pattern.
Regex demo
const regex = /soyad[ıi](.+?)(?:S[ıi]n[ıi]f[ıi]|n(?:o|umara)|[0-9])/gi;
const str = "2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303\n";
console.log(Array.from(str.matchAll(regex), m => m[1]));
This might do it
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
console.log([...str.matchAll(re)].map(x => x[2]))
ES5
const str = `2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞASınıfıE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnumaraE10/ENo303
2015ÖĞLEYEMEKKARTI(2016-20.AdıMEVLÜTSoyadıERTANĞAnoE10/ENo303`
const re = /(?:Soyad(ı|i))(.*?)(?:S(ı|i)n(ı|i)f(ı|i)|no|numara)/gmi
const res = []
let match;
while ((match = re.exec(str)) !== null) res.push(match[2])
console.log(res)

Capture only letter followed by letter, excluding some words - Regex

I need to capture a letter in a string followed by a letter, excluding some specific words. I have the following string in Latex:
22+2p+p^{pp^{2p+pp}}+\delta+\pi+sqrt(2p)+\\frac{2}+{2p}+ppp+2P+\sqrt+xx+\to+p2+\pi+px+ab+\alpha
I want to add * between the letters, but I don't want the following words to apply:
\frac
\delta
\pi
\sqrt
\alpha
The output should be as follows:
22+2p+p^{p*p^{2p+p*p}}+\delta+\pi+\sqrt(2p)+\\frac{2}+{2p}+p*p*p+2P+\sqrt(9)+x*x+\to+p2+\pi+p*x+a*b+\alpha
The letters are dynamic entries, which can be any of the alphabet. I thought about using "positive lookbehind" but its support is limited.
You can achieve the result you want with a string replace with callback, using a regex:
(delta|frac|pi|sqrt|alpha|to)|([a-z](?=[a-z]))
that matches one of the excluded words in group 1 or a letter that is followed by another letter in group 2. In the callback, if group 1 is present, that is returned otherwise group 2 is returned followed by a *:
let str = '22+2p+p^{pp^{2p+pp}}+\\delta+\\pi+\\sqrt(2p)+\\\\frac{2}+{2p}+ppp+2P+\\sqrt(9)+xx+\\to+p2+\\pi+px+ab+\\alpha';
const replacer = (m, p1, p2) => {
return p1 ? p1 : (p2 + '*');
}
console.log(str.replace(/(delta|frac|pi|sqrt|alpha|to)|([a-z](?=[a-z]))/gi, replacer));
You can do something like this:
const str = "22+2p+p^{pp^{2p+pp}}+\\delta+\\pi+\\sqrt(2p)+\\\\frac{2}+{2p}+ppp+2P+\\sqrt+xx+\\to+p2+\\pi+px+ab+\\alpha";
const result = str.replace(/\\?[a-zA-Z]{2,}/g, (v) => {
if (v.startsWith('\\')) {
return v;
}
return v.split("").join("*");
});
console.log(result);
What this does is to match all 2 or more consecutive letters that are preceded by a \ or not and in the replace function, if the matched group is not starting with \, the replacement is set to the letters group split and joined by *.
You could use negative lookbehind to solve this.
const regex = /(?<!\\{1,})(\b[a-zA-Z]{2,}\b)/g;
const str = `22+2p+p^{pp^{2p+pp}}+\\delta+\\pi+\\sqrt(2p)+\\\\frac{2}+{2p}+ppp+2P+\\sqrt+xx+\\to+p2+\\pi+px+ab+\\alpha`;
let m;
let result = str.replace(regex, function(match) {
return match.split("").join("*");
});
console.log("Match: ",str.match(regex).toString());
console.log(result);

regex to extract numbers starting from second symbol

Sorry for one more to the tons of regexp questions but I can't find anything similar to my needs. I want to output the string which can contain number or letter 'A' as the first symbol and numbers only on other positions. Input is any string, for example:
---INPUT--- -OUTPUT-
A123asdf456 -> A123456
0qw#$56-398 -> 056398
B12376B6f90 -> 12376690
12A12345BCt -> 1212345
What I tried is replace(/[^A\d]/g, '') (I use JS), which almost does the job except the case when there's A in the middle of the string. I tried to use ^ anchor but then the pattern doesn't match other numbers in the string. Not sure what is easier - extract matching characters or remove unmatching.
I think you can do it like this using a negative lookahead and then replace with an empty string.
In an non capturing group (?:, use a negative lookahad (?! to assert that what follows is not the beginning of the string followed by ^A or a digit \d. If that is the case, match any character .
(?:(?!^A|\d).)+
var pattern = /(?:(?!^A|\d).)+/g;
var strings = [
"A123asdf456",
"0qw#$56-398",
"B12376B6f90",
"12A12345BCt"
];
for (var i = 0; i < strings.length; i++) {
console.log(strings[i] + " ==> " + strings[i].replace(pattern, ""));
}
You can match and capture desired and undesired characters within two different sides of an alternation, then replace those undesired with nothing:
^(A)|\D
JS code:
var inputStrings = [
"A-123asdf456",
"A123asdf456",
"0qw#$56-398",
"B12376B6f90",
"12A12345BCt"
];
console.log(
inputStrings.map(v => v.replace(/^(A)|\D/g, "$1"))
);
You can use the following regex : /(^A)?\d+/g
var arr = ['A123asdf456','0qw#$56-398','B12376B6f90','12A12345BCt', 'A-123asdf456'],
result = arr.map(s => s.match(/(^A|\d)/g).join(''));
console.log(result);

Split string by all spaces except those in parentheses

I'm trying to split text the following like on spaces:
var line = "Text (what is)|what's a story|fable called|named|about {Search}|{Title}"
but I want it to ignore the spaces within parentheses. This should produce an array with:
var words = ["Text", "(what is)|what's", "a", "story|fable" "called|named|about", "{Search}|{Title}"];
I know this should involve some sort of regex with line.match(). Bonus points if the regex removes the parentheses. I know that word.replace() would get rid of them in a subsequent step.
Use the following approach with specific regex pattern(based on negative lookahead assertion):
var line = "Text (what is)|what's a story|fable called|named|about {Search}|{Title}",
words = line.split(/(?!\(.*)\s(?![^(]*?\))/g);
console.log(words);
(?!\(.*) ensures that a separator \s is not preceded by brace ((including attendant characters)
(?![^(]*?\)) ensures that a separator \s is not followed by brace )(including attendant characters)
Not a single regexp but does the job. Removes the parentheses and splits the text by spaces.
var words = line.replace(/[\(\)]/g,'').split(" ");
One approach which is useful in some cases is to replace spaces inside parens with a placeholder, then split, then unreplace:
var line = "Text (what is)|what's a story|fable called|named|about {Search}|{Title}";
var result = line.replace(/\((.*?)\)/g, m => m.replace(' ', 'SPACE'))
.split(' ')
.map(x => x.replace(/SPACE/g, ' '));
console.log(result);

Matching whole words with Javascript's Regex with a few restrictions

I am trying to create a regex that can extract all words from a given string that only contain alphanumeric characters.
Yes
yes absolutely
#no
*NotThis
orThis--
Good *Bad*
1ThisIsOkay2 ButNotThis2)
Words that should have been extracted: Yes, yes, absolutely, Good, 1ThisIsOkay2
Here is the work I have done thus far:
/(?:^|\b)[a-zA-Z0-9]+(?=\b|$)/g
I had found this expression that works in Ruby ( with some tweaking ) but I have not been able to convert it to Javascript regex.
Use /(?:^|\s)\w+(?!\S)/g to match 1 or more word chars in between start of string/whitespace and another whitespace or end of string:
var s = "Yes\nyes absolutely\n#no\n*NotThis\norThis-- \nGood *Bad*\n1ThisIsOkay2 ButNotThis2)";
var re = /(?:^|\s)\w+(?!\S)/g;
var res = s.match(re).map(function(m) {
return m.trim();
});
console.log(res);
Or another variation:
var s = "Yes\nyes absolutely\n#no\n*NotThis\norThis-- \nGood *Bad*\n1ThisIsOkay2 ButNotThis2)";
var re = /(?:^|\s)(\w+)(?!\S)/g;
var res = [];
while ((m=re.exec(s)) !== null) {
res.push(m[1]);
}
console.log(res);
Pattern details:
(?:^|\s) - either start of string or whitespace (consumed, that is why trim() is necessary in Snippet 1)
\w+ - 1 or more word chars (in Snippet 2, captured into Group 1 used to populate the resulting array)
(?!\S) - negative lookahead failing the match if the word chars are not followed with non-whitespace.
You can do that (where s is your string) to match all the words:
var m = s.split(/\s+/).filter(function(i) { return !/\W/.test(i); });
If you want to proceed to a replacement, you can do that:
var res = s.split(/(\s+)/).map(function(i) { return i.replace(/^\w+$/, "#");}).join('');

Categories

Resources