Regex string with pattern - javascript

For Regex fans... What I have is this string:
"Lorem ipsum dolor FOO IO BAR BA"
I'd like to extract the Title, and an Array of the UPPERCASE suffixes:
"Lorem ipsum dolor"
["FOO", "IO", "BAR", "BA"]
Here's my attempt:
function retrieveGroups( string )
{
var regexp = new RegExp(/(FOO|BAR|BA|IO)/g);
var groups = string.match( regexp ) || [];
var title = string.replace( regexp, "" );
return {title:title, groups:groups};
}
results in:
title : "Lorem ipsum dolor ",
groups : ["FOO" , "IO", "BAR", "BA"]
which is great, but It'll not prevent this cases:
LoremFOO ipBAsum IO dolor FOO
where in that cas I need only ["FOO"] in the resulting group.
The rule seems simple...
Get the title.
Title could be all uppercase ("LOREM IPSUM").
Get an array of uppercase suffixes.
Grouops (FOO,BAR,IO,BA) might not be present in the string.
Don't match suffix if it's not: a suffix and is not lead by a whitespace
Start matching from end of string (if possible?) so don't match duplicate Group parameters if encountered (issue example above)
I've also tried to string.replace(regexp, function(val) .... but I'm not sure how it could help...
Don't know if it helps but fiddle is here. Thank you!

To get an array of uppercase suffixes.
> "Lorem ipsum dolor FOO IO BAR BA".match(/\b[A-Z]+\b(?!\s+\S*[^A-Z\s]\S*)/g)
[ 'FOO',
'IO',
'BAR',
'BA' ]
> "LoremFOO ipBAsum IO dolor FOO".match(/\b[A-Z]+\b(?!\s+\S*[^A-Z\s]\S*)/g)
[ 'FOO' ]
To get the title array.
> "LoremFOO ipBAsum IO dolor FOO".match(/^.*?(?=\s*\b[A-Z]+\b(?:\s+[A-Z]+\b|$))/g)
[ 'LoremFOO ipBAsum IO dolor' ]
> "Lorem ipsum dolor FOO IO BAR BA".match(/^.*?(?=\s*\b[A-Z]+\b(?:\s+[A-Z]+\b|$))/g)
[ 'Lorem ipsum dolor' ]
Update:
> "LoremFOO ipBAsum IO dolor FOO".match(/\b(?:FOO|BAR|BA|IO)\b(?!\s+\S*[^A-Z\s]\S*)/g)
[ 'FOO' ]
\b called word boundary which matches between a word character and a non-word character.
(?:FOO|BAR|BA|IO)\b matches FOO or BAR or BA or IO and also the following word boundary,
(?!\s+\S*[^A-Z\s]\S*) only if it's not followed by one or more space character , zero or more non-space characters and a character other than a space or an uppercase letter, again followed by zero or more non-space characters. So this fails for IO because it's followed by a word which contain atleast one lowercase letter. (?!...) called negative lookahead assertion.
> "Lorem ipsum dolor FOO IO BAR BA".match(/\b(?:FOO|BAR|BA|IO)\b(?!\s+\S*[^A-Z\s]\S*)/g)
[ 'FOO',
'IO',
'BAR',
'BA' ]
And also, you could use a positive lookahead based regex also. (?=....) called positive lookahead assertion.
> "LoremFOO ipBAsum IO dolor FOO".match(/\b(?:FOO|BAR|BA|IO)\b(?=\s+(?:FOO|BAR|BA|IO)\b|$)/g)
[ 'FOO' ]
To get the title array.
> "Lorem ipsum dolor FOO IO BAR BA".match(/^.*?(?=\s*\b(?:FOO|BAR|BA|IO)\b(?:\s+(?:FOO|BAR|BA|IO)\b|$))/g)
[ 'Lorem ipsum dolor' ]
> "LoremFOO ipBAsum IO dolor FOO".match(/^.*?(?=\s*\b(?:FOO|BAR|BA|IO)\b(?:\s+(?:FOO|BAR|BA|IO)\b|$))/g)
[ 'LoremFOO ipBAsum IO dolor' ]

Maybe this is what you are looking for:
function retrieveGroups( string )
{
var regexp = new RegExp(/^(.*?)\s*([ A-Z]+)*$/);
var result = string.match( regexp ) || [];
var title = result[1];
var groups=result[2].split(" ");
return {title:title, groups:groups};
}
Edit:
Here a solution for a fixed set of Uppercase Words:
function retrieveGroups( string )
{
var regexp = new RegExp(/^(.*?)\s*((?:\s|FOO|BAR|IO|BA)+)?$/);
var result = string.match( regexp ) || [];
var title = result[1];
var groups=result[2].split(" ");
return {title:title, groups:groups};
}

By using Avinash's RegEx one can extract all the valid suffixes.
The title would be all text before the first suffix.
So the final JavaScript code will look like below:
var arr = ['Lorem ipsum dolor FOO IO BAR BA', 'LoremFOO ipBAsum IO dolor FOO']
arr.forEach(function(str) {
var o = retrieveGroups(str);
alert("Parsed title = " + o.title + ", groups=" + o.groups);
});
function retrieveGroups( string ) {
var regex = /\b(?:FOO|BAR|BA|IO)\b(?=\s+(?:FOO|BAR|BA|IO)\b|$)/g
var groups = string.match( regex ) || [];
var title = string.replace( regex, '').trim();
return {'title':title, 'groups':groups};
}
Here is DEMO

Related

How to use regex to extract boolean operators followed by words until the next operator?

I'm trying to put together a relatively simple expression for extracting boolean string operators (AND, OR, NOT, etc) coming from user input, in a way that the resulting array of matches would contain words and the preceding operator until the next operator:
const query = 'lorem AND ipsum dolor OR fizz NOT buzz';
results should be like:
[
['AND', 'ipsum dolor'],
['OR', 'fizz'],
['NOT', 'buzz']
]
I've created this for getting single words after each operator, which is fine:
^(\w+\s?)+?|(AND) (\w+)|(OR) (\w+)|(NOT) (\w+)
then tried to modify it to handle multiple words after an operator in order to obtain the above result, but its always greedy and captures the whole string input:
(AND|OR|NOT) (\w+\s?)+ (?:AND|OR|NOT)
UPDATE
I'we figured it out, but I'm not sure how pretty or efficient it is:
^(\w+)|(AND|OR|NOT) (.*?(?= AND|OR|NOT))|(AND|OR|NOT) .*?$
You might also use a negative lookahead to assert that the word characters after so not start with either one of the alternatives
\b(AND|OR|NOT) ((?!AND|OR|NOT)\b\w+(?: (?!AND|OR|NOT)\w+)*)
Regex demo
const regex = /\b(AND|OR|NOT) ((?!AND|OR|NOT)\b\w+(?: (?!AND|OR|NOT)\w+)*)/gm;
const str = `lorem AND ipsum dolor OR fizz NOT buzz`;
let m;
let result = [];
while ((m = regex.exec(str)) !== null) {
result.push([m[1], m[2]]);
}
console.log(result);
I don't htink you can get there purely with regular expressions in JavaScript, but you can get awfully close:
const query = 'lorem AND ipsum dolor OR fizz NOT buzz';
const rex = /\b(AND|OR|NOT|NEAR)\b\s*(.*?)\s*(?=$|\b(?:AND|OR|NOT|NEAR)\b)/ig;
const result = [...query.matchAll(rex)].map(([_, op, text]) => [op, text]);
console.log(result);
The regex /\b(AND|OR|NOT|NEAR)\b\s*(.*?)\s*(?=$|\b(?:AND|OR|NOT|NEAR)\b)/ig looks for:
A word break
One of your operators (capturing it)
A word break
Zero or more whitespace chars (capturing them)
A non-greedy match for anything
Zero or more whitespace chars
Either another operator or the end of the string
The map after the matchAll call is just there to remove the initial array entry (the one with the full text of the match). I've done it with destructuring, but you could use slice instead:
const result = [...query.matchAll(rex)].map(match => match.slice(1));
const query = 'lorem AND ipsum dolor OR fizz NOT buzz';
const rex = /\b(AND|OR|NOT|NEAR)\b\s*(.*?)\s*(?=$|\b(?:AND|OR|NOT|NEAR)\b)/ig;
const result = [...query.matchAll(rex)].map(match => match.slice(1));
console.log(result);
I'm uncapable of doing that with regexp but this is a super simple solution that could work.
let q = 'lorem AND ipsum dolor OR fizz NOT buzz';
let special = ['AND', 'OR', 'NOT'];
let fullResult = [];
let skip = true;
q.split(' ').forEach( word => {
if ( special.indexOf(word) !== -1 ) {
fullResult.push([word]);
skip = false;
} else if (!skip){
fullResult[fullResult.length-1].push(word);
}
});
console.log(fullResult);

Replace a character of a string

I have a string that looks like this: [TITLE|prefix=a].
From that string, the text |prefix=a is dynamic. So it could be anything or empty. I would like to replace (in that case) [TITLE|prefix=a] with [TITLE|prefix=a|suffix=z].
So the idea is to replace ] from a string that starts with [TITLE with |suffix=z].
For instance, if the string is [TITLE|prefix=a], it should be replaced with [TITLE|prefix=a|suffix=z]. If it's [TITLE], it should be replaced with [TITLE|suffix=z] and so on.
How can I do this with RegEx?
I have tried it this way but it gives an error:
let str = 'Lorem ipsum [TITLE|prefix=a] dolor [sit] amet [consectetur]';
const x = 'TITLE';
const regex = new RegExp(`([${x})*]`, 'gi');
str = str.replace(regex, "$1|suffix=z]");
console.log(str);
I have also tried to escape the characters [ and ] with new RegExp(`(\[${x})*\]`, 'gi'); but that didn't help.
You need to remember to use \\ in a regular string literal to define a single literal backslash.
Then, you need a pattern like
/(\[TITLE(?:\|[^\][]*)?)]/gi
See the regex demo. Details:
(\[TITLE\|[^\][]*) - Capturing group 1:
\[TITLE - [TITLE text
(?:\|[^\][]*)? - an optional occurrence of a | char followed with 0 or more chars other than ] and [
] - a ] char.
Inside your JavaScript code, use the following to define the dynamic pattern:
const regex = new RegExp(`(\\[${x}\\|[^\\][]*)]`, 'gi');
See JS demo:
let str = 'Lorem ipsum [TITLE|prefix=a] dolor [sit] amet [consectetur] [TITLE]';
const x = 'TITLE';
const regex = new RegExp(`(\\[${x}(?:\\|[^\\][]*)?)]`, 'gi');
str = str.replace(regex, "$1|suffix=z]");
console.log(str);
// => Lorem ipsum [TITLE|prefix=a|suffix=z] dolor [sit] amet [consectetur]
I think the solution to your problem would look similar to this:
let str = 'Lorem ipsum [TITLE|prefix=a] dolor [sit] amet [consectetur]';
str = str.replace(/(\[[^\|\]]+)(\|[^\]]*)?\]/g, "$1$2|suffix=z]");
console.log(str);

Compare two strings and return different stretch

I have two strings like these:
var str1 = "lorem ipsum dolor sit amet";
var str2 = "ipsum dolor";
And I'm trying to compare them and get, as result, an array with everything that doesn't match this comparation and the match! One for the beggining and the other one for the ending.
E.g: In this case above, the return should be an array like this:
result[0] //should keep the begining **"lorem "** (with the blank space after word)
result[1] // should keep the ending **" sit amet"** (with the blank space before word)
result[2] // should keep the match **"ipsum dolor"**
All I got was an elegant solution posted by #Mateja Petrovic. But I can get this values separately.
Just like this:
const A = "ipsum dolor"
const B = "lorem ipsum dolor sit amet"
const diff = (diffMe, diffBy) => diffMe.split(diffBy).join('')
const C = diff(B, A)
console.log(C) // jumps over the lazy dog.
I'm really stuck! Any idea?
Thanks a lot!
const diff = (str, query) => [...str.split(query), query];
/*
1. we split the string into an array whenever query is found
2. we spread (...) the array into a new array
3. we add the query at the end of the new array
*/
var str = "lorem ipsum dolor sit amet";
var query = "ipsum dolor";
const result = diff(str, query)
console.log(result)
console.log(result[0])
console.log(result[1])
console.log(result[2])
I don't think what you are asking is logically possible. a "string" can be a single letter, which is going to get very messy. at best maybe something like this?
var str1 = "lorem ipsum dolor sit amet";
var str2 = "ipsum dolor";
var parts = str1.split(str2);
console.log(parts); // ['lorem ', ' sit amet']
You could split the target string by the search string, and append the latter to the solution.
const haystack = 'lorem ipsum dolor sit amet';
const needle = 'ipsum dolor';
const diff = (needle, haystack) =>
[...haystack.split(needle), needle]
console.log(diff(needle,haystack));
// ["lorem ", " sit amet", "ipsum dolor"]

Regexp to match words two by two (or n by n)

I'm looking for a regexp which is able to match words n by n. Let's say n := 2, it would yield:
Lorem ipsum dolor sit amet, consectetur adipiscing elit
Lorem ipsum, ipsum dolor, dolor sit, sit amet (notice the comma here), consectetur adipiscing, adipiscing elit.
I have tried using \b for word boundaries to no avail. I am really lost trying to find a regex capable of giving me n words... /\b(\w+)\b(\w+)\b/i can't cut it, and even tried multiple combinations.
Regular expressions are not really what you need here, other than to split the input into words. The problem is that this problem involves matching overlapping substrings, which regexp is not very good at, especially the JavaScript flavor. Instead, simply break the input into words, and a quick piece of JavaScript will generate the "n-grams" (which is the correct term for your n-word groups).
const input = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";
// From an array of words, generate n-grams.
function ngrams(words, n) {
const results = [];
for (let i = 0; i < words.length - n + 1; i++)
results.push(words.slice(i, i + n));
return results;
}
console.log(ngrams(input.match(/\w+./g), 2));
A word boundary \b does not consume any characters, it is a zero-width assertion, and only asserts the position between a word and non-word chars, and between start of string and a word char and between a word char and end of string.
You need to use \s+ to consume whitespaces between words, and use capturing inside a positive lookahead technique to get overlapping matches:
var n = 2;
var s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";
var re = new RegExp("(?=(\\b\\w+(?:\\s+\\w+){" + (n-1) + "}\\b))", "g");
var res = [], m;
while ((m=re.exec(s)) !== null) { // Iterating through matches
if (m.index === re.lastIndex) { // This is necessary to avoid
re.lastIndex++; // infinite loops with
} // zero-width matches
res.push(m[1]); // Collecting the results (group 1 values)
}
console.log(res);
The final pattern will be built dynamically since you need to pass a variable to the regex, thus you need a RegExp constructor notation. It will look like
/(?=(\b\w+(?:\s+\w+){1}\b))/g
And it will find all locations in the string that are followed with the following sequence:
\b - a word boundary
\w+ - 1 or more word chars
(?:\s+\w+){n} - n sequences of:
\s+ - 1 or more whitespaces
\w+ - 1 or more word chars
\b - a trailing word boundary
Not a pure regex solution, but it works and is easy to read and understand:
let input = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit';
let matches = input.match(/(\w+,? \w+)/g)
.map(str => str.replace(',', ''));
console.log(matches) // ['Lorem ipsum', 'dolor sit', 'amet consectetur', 'adipiscing elit']
Warning: Does not check for no matches (match() returns null)

Regex get the middle section of each word javascript

So essentially what I'm trying to do is loop through every word in a html document and replace the first letter of each word with 'A', the second - second last letter with 'b' and the last letter with 'c', completely replacing the word. I'm not sure if regular expressions are the way to go about doing this (should I instead be using for loops and checking each character?) however I'll ask anyway.
Currently I'm doing:
document.body.innerHTML = document.body.innerHTML.replace(/\b(\w)/g, 'A'); to get the first letter of each word
document.body.innerHTML = document.body.innerHTML.replace(/\w\b/g, 'c'); to get the last letter of each word
So if I had the string: Lorem ipsum dolor sit amet I can currently make it Aorec Apsuc Aoloc Aic Amec but I'd like to do Abbbc Abbbc Abbbc Abc Abbc in javascript.
Any help is much appreciated - regular expressions really confuse me.
You almost got it.
str = "Lorem ipsum dolor sit amet"
str = str
.replace(/\w/g, 'b')
.replace(/\b\w/g, 'A')
.replace(/\w\b/g, 'c')
document.write(str);
Fancier replacement rules can be handled with a callback function, e.g.
str = "Lorem ipsum dolor sit amet"
str = str.replace(/\w+/g, function(word) {
if (word === "dolor")
return word;
return 'A' + 'b'.repeat(word.length - 2) + 'c';
});
document.write(str);

Categories

Resources