Regex - collect characters between obligatory prefix and optional groups - javascript

I'm creating regex in JavaScript that find all groups occurrences, all optional.
I have collected optional groups (thanks for #wiktor-stribiżew) now. Missing thing is gathering characters between new- prefix and first occurred group.
Input:
new-rooms-3-area-50
new-poland-warsaw-rooms-3-area-50-bar
new-some-important-location-rooms-3-asdads-anything-area-50-uiop
new-another-location-area-50-else
Requested output:
["rooms-3", "area-50"]
["poland-warsaw", "rooms-3", "area-50"]
["some-important-location", "rooms-3", "area-50"]
["another-location", "area-50"]
I have now
new-(?:.*?(rooms-\d+))?.*?(area-\d+)
regex. I think that collecting .* between new- and rooms|area may be stupid solution.
Online demo: https://regex101.com/r/QvmYN0/5
Note: I created two separated questions, because it refers to 2 separately problems. I hope that somebody have similar problems in the future.

I think it is better to split by steps like this:
// Split by \n to work with each line
getArrays = input => input.split`\n`.map(x => {
// Split by your desired delimiters:
// -dashes which has "area" or "rooms" in front
return x.split(/-(?=area-|rooms-)/g).map(y => {
// remove the "new-" from start or anything in front the numbers
return y.replace(/^new-|\D+$/, '');
// make sure you don't have empty cases
}).filter(y => y);
});
var txt = `new-rooms-3-area-50
new-poland-warsaw-rooms-3-area-50-bar
new-some-important-location-rooms-3-asdads-anything-area-50-uiop
new-another-location-area-50-else`;
console.log(getArrays(txt));
EDIT:
The above code returns the requested output. However, I was thinking you should want an array of models instead:
// initial state of your model
getModel = () => ({
new: '',
area: 0,
rooms: 0,
});
// the function that will return the array of models:
getModels = input => input.split`\n`.map(line => {
var model = getModel();
// set delimiters:
var delimiters = new RegExp(
'-(?=(?:' + Object.keys(model).join`|` + ')-)', 'g');
// set the properties of your model:
line.split(delimiters).forEach(item => {
// remove non-digits after the last digit:
item.replace(/(\d)\D+$/, '$1')
// set each matched property:
.replace(/^([^-]+)-(.*)/,
(whole_match, key, val) => model[key] = val);
});
return model;
});
var txt = `new-rooms-3-area-50
new-poland-warsaw-rooms-3-area-50-bar
new-some-important-location-rooms-3-asdads-anything-area-50-uiop
new-another-location-area-50-else`;
console.log(getModels(txt));

This is the high-end solution which does it all at once.
Doesn't split or massage the data, just takes it as is (and always will be).
It may not be for beginners, but be for the more experienced.
(Note that I don't know JS, but I can tell you, this took about 20 minutes
googling about strings. This is just too easy, do people really get paid
to do this ?!)
This uses exec to push each element ( group 2 )
and create an array of records, one for each line.
( ^ new ) # (1)
|
( # (2 start)
(?: rooms | area )
- \d+
| (?:
(?:
(?!
(?: rooms | area )
- \d+
)
[a-z]
)+
(?:
-
(?:
(?!
(?: rooms | area )
- \d+
)
[a-z]
)+
)+
)
) # (2 end)
var strTarget = "\
new-rooms-3-area-50\n\
new-poland-warsaw-rooms-3-area-50-bar\n\
new-some-important-location-rooms-3-asdads-anything-area-50-uiop\n\
new-another-location-area-50-else\n\
";
var RxLine = /^new.+/mg;
var RxRecord = /(^new)|((?:rooms|area)-\d+|(?:(?:(?!(?:rooms|area)-\d+)[a-z])+(?:-(?:(?!(?:rooms|area)-\d+)[a-z])+)+))/g;
var records = [];
var matches
var match;
while( (match = RxLine.exec( strTarget )) ){
var line = match[0];
matches = [];
while( (match = RxRecord.exec( line )) ){
if ( match[2] )
matches.push( match[2] );
}
records.push( matches );
}
console.log( records );

Here you go:
new-(.*?)?-?(rooms-\d+|area-\d+).*?(area-\d+)?.*
Demo: https://regex101.com/r/Qvdkdx/1

Related

Use just regexp to split a string into a 'tuple' of filename and extension?

I know there are easier ways to get file extensions with JavaScript, but partly to practice my regexp skills I wanted to try and use a regular expression to split a filename into two strings, before and after the final dot (. character).
Here's what I have so far
const myRegex = /^((?:[^.]+(?:\.)*)+?)(\w+)?$/
const [filename1, extension1] = 'foo.baz.bing.bong'.match(myRegex);
// filename1 = 'foo.baz.bing.'
// extension1 = 'bong'
const [filename, extension] = 'one.two'.match(myRegex);
// filename2 = 'one.'
// extension2 = 'two'
const [filename, extension] = 'noextension'.match(myRegex);
// filename2 = 'noextension'
// extension2 = ''
I've tried to use negative lookahead to say 'only match a literal . if it's followed by a word that ends in, like so, by changing (?:\.)* to (?:\.(?=\w+.))*:
/^((?:[^.]+(?:\.(?=(\w+\.))))*)(\w+)$/gm
But I want to exclude that final period using just the regexp, and preferably have 'noextension' be matched in the initial group, how can I do that with just regexp?
Here is my regexp scratch file: https://regex101.com/r/RTPRNU/1
For the first capture group, you could start the match with 1 or more word characters. Then optionally repeat a . and again 1 or more word characters.
Then you can use an optional non capture group matching a . and capturing 1 or more word characters in group 2.
As the second non capture group is optional, the first repetition should be on greedy.
^(\w+(?:\.\w+)*?)(?:\.(\w+))?$
The pattern matches
^ Start of string
( Capture group 1
\w+(?:\.\w+)*? Match 1+ word characters, and optionally repeat . and 1+ word characters
) Close group 1
(?: Non capture group to match as a whole
\.(\w+) Match a . and capture 1+ word chars in capture group 2
)? Close non capture group and make it optional
$ End of string
Regex demo
const regex = /^(\w+(?:\.\w+)*?)(?:\.(\w+))?$/;
[
"foo.baz.bing.bong",
"one.two",
"noextension"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
console.log(m[2]);
console.log("----");
}
});
Another option as #Wiktor Stribiżew posted in the comments, is to use a non greedy dot to match any character for the filename:
^(.*?)(?:\.(\w+))?$
Regex demo
Just wanted to do a late pitch-in on this because I wanted to split up a filename into a "name" and an "extension" part - and wasn't able to find any good solutions supporting all my test cases ... and I wanted to support filenames starting with "." which should return as the "name" and I wanted to support files without any extension too.
So I'm using this line which handles all my use-cases
const [name, ext] = (filename.match(/(.+)+\.(.+)/) || ['', filename]).slice(1)
Which will give this output
'.htaccess' => ['.htaccess', undefined]
'foo' => ['foo', undefined]
'foo.png' => ['foo', 'png']
'foo.bar.png' => ['foo.bar', 'png']
'' => ['', undefined]
I find that to be what I want.
If you really want to use regex, I would suggest to use two regex:
// example with 'foo.baz.bing.bong'
const firstString = /^.+(?=\.\w+)./g // match 'foo.baz.bing.'
const secondString = /\w+$/g // match 'bong'
How about something more explicit and accurate without looking around ...
named groups variant ... /^(?<noextension>\w+)$|(?<filename>\w+(?:\.\w+)*)\.(?<extension>\w+)$/
without named groups ... /^(\w+)$|(\w+(?:\.\w+)*)\.(\w+)$/
Both of the just shown variants can be shortened to 2 capture groups instead of the above variant's 3 capture groups, which in my opinion makes the regex easier to work with at the cost of being less readable ...
named groups variant ... /(?<filename>\w+(?:\.\w+)*?)(?:\.(?<extension>\w+))?$/
without named groups ... /(\w+(?:\.\w+)*?)(?:\.(\w+))?$/
const testData = [
'foo.baz.bing.bong',
'one.two',
'noextension',
];
// https://regex101.com/r/RTPRNU/5
const regXTwoNamedFileNameCaptures = /(?<filename>\w+(?:\.\w+)*?)(?:\.(?<extension>\w+))?$/;
// https://regex101.com/r/RTPRNU/4
const regXTwoFileNameCaptures = /(\w+(?:\.\w+)*?)(?:\.(\w+))?$/;
// https://regex101.com/r/RTPRNU/3
const regXThreeNamedFileNameCaptures = /^(?<noextension>\w+)$|(?<filename>\w+(?:\.\w+)*)\.(?<extension>\w+)$/
// https://regex101.com/r/RTPRNU/3
const regXThreeFileNameCaptures = /^(\w+)$|(\w+(?:\.\w+)*)\.(\w+)$/
console.log(
'based on 2 named file name captures ...\n',
testData, ' =>',
testData.map(str =>
regXTwoNamedFileNameCaptures.exec(str)?.groups ?? {}
)
);
console.log(
'based on 2 unnamed file name captures ...\n',
testData, ' =>',
testData.map(str => {
const [
match,
filename,
extension,
] = str.match(regXTwoFileNameCaptures) ?? [];
//] = regXTwoFileNameCaptures.exec(str) ?? [];
return {
filename,
extension,
}
})
);
console.log(
'based on 3 named file name captures ...\n',
testData, ' =>',
testData.map(str => {
const {
filename = '',
extension = '',
noextension = '',
} = regXThreeNamedFileNameCaptures.exec(str)?.groups ?? {};
return {
filename: filename || noextension,
extension,
}
})
);
console.log(
'based on 3 unnamed file name captures ...\n',
testData, ' =>',
testData.map(str => {
const [
match,
noextension = '',
filename = '',
extension = '',
] = str.match(regXThreeFileNameCaptures) ?? [];
//] = regXThreeFileNameCaptures.exec(str) ?? [];
return {
filename: filename || noextension,
extension,
}
})
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

How to extract string between dash with regex in javascript?

I have a string in javascript:
const str = "bugfix/SOME-9234-add-company"; // output should be SOME-9234
const str2 = "SOME/SOME-933234-add-company"; // output should be SOME-933234
const str3 = "test/SOME-5559234-add-company"; // output should be SOME-5559234
and I want to extract the SOME-.. until the first - char.
I use this regex but didn't work. what is the correct regex?
const s = "bugfix/SOME-9234-add-company";
const r1 = s.match(/SOME-([1-9])/);
const r2 = s.match(/SOME-(.*)/);
const r3 = s.match(/SOME-(.*)-$/);
console.log({ r1, r2, r3 });
You could use the /(SOME-[\d]+)/g regex, e.g.
const strings = [
"bugfix/SOME-9234-add-company", // output should be SOME-9234
"SOME/SOME-933234-add-company", // output should be SOME-933234
"test/SOME-5559234-add-company" // output should be SOME-5559234
];
strings.forEach(string => {
const regex = /(SOME-[\d]+)/g;
const found = string.match(regex);
console.log(found[0]);
});
const s = "bugfix/SOME-9234-add-company";
// This one is close, but will return only on digit
console.log( s.match(/SOME-([1-9])/) );
// What you wanted:
console.log( s.match(/SOME-([1-9]+)/) ); // note the +, meaning '1 or more'
// these are also close.
// This'll give you everything after 'SOME':
console.log( s.match(/SOME-(.*)/) );
// This'll match if the last character of the line is a -
console.log( s.match(/SOME-(.*)-$/) );
//What you wanted:
console.log( s.match(/SOME-(.*?)-/) ); // Not end of line, but 'ungreedy'(the ?) which does 'untill the first - you encounter'
// instead of Using [1-9], you can also use \d, digits, for readability:
console.log( s.match(/SOME-(\d+)/) );
/(SOME-[^-]+)/ should work well. (capture everything what is not an hyphen after SOME-)
Or, if you know you have only digits, close to what you tried:
/(SOME-[1-9]+)/
You were missing a + to take more than one character.
I also changed the parenthesis to capture exactly what you show in the question (i.e., including the part with SOME)
If You don't want to use regex then this could be simple
const s = "bugfix/SOME-9234-add-company";
const str2 = "SOME/SOME-933234-add-company";
const str3 = "test/SOME-5559234-add-company";
const r1 = s.split("/")[1].split("-",2).join("-"); // "SOME-9234"
const r2 = str2.split("/")[1].split("-",2).join("-"); // "SOME-933234"
const r3 = str3.split("/")[1].split("-",2).join("-"); // "SOME-5559234"
In the patterns that you tried:
$ asserts the end of the string,
[1-9] matches a single digit 1-9 without the 0
.* will match any characters without a newline 0+ times
There is no need for capturing groups, you could match:
\bSOME-\d+
See a regex demo
Note that match will return an array, from which you could take the 0 index.
[
"bugfix/SOME-9234-add-company",
"SOME/SOME-933234-add-company",
"test/SOME-5559234-add-company"
].forEach(s => console.log(s.match(/\bSOME-\d+/)[0]));

Regex to match all of symbols but except a word

How do regex to match all of symbols but except a word?
Need find all symbols except a word.
(.*) - It find all symbols.
[^v] - It find all symbols except letter v
But do how find all symbols except a word?
Solution (writed below):
((?:(?!here any word for block)[\s\S])*?)
or
((?:(?!here any word for block).)*?)
((?:(?!video)[\s\S])*?)
I want to find all except |end| and replace all except `|end|.
I try:
Need all except |end|
var str = '|video| |end| |water| |sun| |cloud|';
// May be:
//var str = '|end| |video| |water| |sun| |cloud|';
//var str = '|cloud| |video| |water| |sun| |end|';
str.replace(/\|((?!end|end$).*?)\|/gm, test_fun2);
function test_fun2(match, p1, offset, str_full) {
console.log("--------------");
p1 = "["+p1+"]";
console.log(p1);
console.log("--------------");
return p1;
}
Output console log:
--------------
[video]
--------------
--------------
--------------
--------------
--------------
--------------
--------------
Example what need:
Any symbols except [video](
input - '[video](text-1 *******any symbols except: "[video](" ******* [video](text-2 any symbols) [video](text-3 any symbols) [video](text-4 any symbols) [video](text-5 any symbols)'
output - <div>text-1 *******any symbols except: "[video](" *******</div> <div>text-2 any symbols</div><div>text-3 any symbols</div><div>text-4 any symbols</div><div>text-5 any symbols</div>
Scenario 1
Use the best trick ever:
One key to this technique, a key to which I'll return several times, is that we completely disregard the overall matches returned by the regex engine: that's the trash bin. Instead, we inspect the Group 1 matches, which, when set, contain what we are looking for.
Solution:
s = s.replace(/\|end\||\|([^|]*)\|/g, function ($0, $1) {
return $1 ? "[" + $1 + "]" : $0;
});
Details
\|end\| - |end| is matched
| - or
\|([^|]*)\| - | is matched, any 0+ chars other than | are captured into Group 1, and then | is matched.
If Group 1 matched ($1 ?) the replacement occurs, else, $0, the whole match, is returned back to the result.
JS test:
console.log(
"|video| |end| |water| |sun| |cloud|".replace(/\|end\||\|([^|]*)\|/g, function ($0, $1) {
return $1 ? "[" + $1 + "]" : $0;
})
)
Scenario 2
Use
.replace(/\[(?!end])[^\]]*]\(((?:(?!\[video]\()[\s\S])*?)\)/g, '<div>$1</div>')
See the regex demo
Details
\[ - a [ char
(?!end]) - no end] allowed right after the current position
[^\]]* - 0+ chars other than ] and [
] - a ] char
\( - a ( char
((?:(?!\[video])[\s\S])*?) - Group 1 that captures any char ([\s\S]), 0 or more occurrences, but as few as possible (*?) that does not start a [video]( char sequence
\) - a ) char.
Something like this is better done in multiple steps. Also, if you're matching stuff, you should use match.
var str = '|video| |end| |water| |sun| |cloud|';
var matches = str.match(/\|.*?\|/g);
// strip pipe characters...
matches = matches.map(m=>m.slice(1,-1));
// filter out unwanted words
matches = matches.filter(m=>!['end'].includes(m));
// this allows you to add more filter words easily
// if you'll only ever need "end", just do (m=>m!='end')
console.log(matches); // ["video","water","sun","cloud"]
Notice how this is a lot easier to understand what's going on, and also much easier to maintain and change in future as needed.
You are on the right track. Here is what you need to do with regex:
var str = '|video| |end| |water| |sun| |cloud|';
console.log(str.replace(/(?!\|end\|)\|(\S*?)\|/gm, test_fun2));
function test_fun2(match, p1, offset, str_full) {
return "["+p1+"]";
}
And an explanation of what was wrong - you had your negative-lookahead placed after the | character. That means that the matching engine would do the following:
Match |video| because the pattern works with it
Grab the next |
Find that the next text is end which is in the negative lookahead and drop it.
Grab the | immediately after end
grab the space and the next | character, since this passes the negative lookahead and also works with .*?
continue grabbing the intermediate | | sequences because the | in the beginning of the word was consumed by the previous match.
So you end up matching the following things
var str = '|video| |end| |water| |sun| |cloud|';
^^^^^^^ ^^^ ^^^ ^^^
|video| ______| | | |
| | ____________________| | |
| | ____________________________| |
| | __________________________________|
All because the |end match was dropped.
You can see this if you print out the matches
var str = '|video| |end| |water| |sun| |cloud|';
str.replace(/\|((?!end|end$).*?)\|/gm, test_fun2);
function test_fun2(match, p1, offset, str_full) {
console.log(match, p1, offset);
}
You will see that the second, third, and fourth match is | | the captured item p1 is - a blank space (not very well displayed, but there) and the offset they were found were 12, 20, 26
|video| |end| |water| |sun| |cloud|
01234567890123456789012345678901234
^ ^ ^
12 _________| | |
20 _________________| |
26 _______________________|
The change I made was to instead look for explicitly the |end| pattern in a negative lookahead and also to only match non-whitespace characters, so you don't grab | | again.
Also worth noting that you can move your filtering logic to the replacement callback instead, instead of the regex. This simplifies the regex but makes your replacement more complex. Still, it's a fair tradeoff, as code is usually easier to maintain if you have more complex conditions:
var str = '|video| |end| |water| |sun| |cloud|';
//capturing word characters - an alternative to "non-whitespace"
console.log(str.replace(/\|(\w*)\|/gm, test_fun2));
function test_fun2(match, p1, offset, str_full) {
if (p1 === 'end') {
return match;
} else {
return "[" + p1 + "]"
}
}

understanding this regular expressions

var keys = {};
source.replace(
/([^=&]+)=([^&]*)/g,
function(full, key, value) {
keys[key] =
(keys[key] ? keys[key] + "," : "") + value;
return "";
}
);
var result = [];
for (var key in keys) {
result.push(key + "=" + keys[key]);
}
return result.join("&");
}
alert(compress("foo=1&foo=2&blah=a&blah=b&foo=3"));
i still confuse with this /([^=&]+)=([^&]*)/g , the + and * use for ?
The ^ means NOT these, the + means one or more characters matching, the () are groups. And the * is any ammount of matches (0+).
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
So by looking at it, I'm guesing its replacing anything thats NOT =&=& or &=& or ==, which is wierd.
+ and * are called quantifiers. They determine how many times can a subset match (the set of characters immediately preceding them usually grouped with [] or () to which the quantifiers apply) repeat.
/ start of regex
( group 1 starts
[^ anything that does not match
=& equals or ampersand
]+ one or more of above
) group 1 ends
= followed by equals sign followed by
( group 2 starts
[^ anything that does not match
=& ampersand
]* zero or more of above
) group 2 ends
/ end of regex

Extract all matches from given string

I have string:
=?windows-1256?B?IObH4cPM5dLJIA==?= =?windows-1256?B?x+HYyO3JIC4uLg==?= =?windows-1256?B?LiDH4djj5s3Hyg==?= =?windows-1256?B?Rlc6IOTP5skgKA==?=
I need to extract all matches between ?B? and ==?=.
As a result I need:
IObH4cPM5dLJIA
x+HYyO3JIC4uLg
LiDH4djj5s3Hyg
Rlc6IOTP5skgKA
P.S. This string is taken from textarea and after function executed, script should replace current textarea value with result. I've tried everything,
var result = str.substring(str.indexOf('?B?')+3,str.indexOf('==?='));
Works almost the way I need, but it only finds first match. And this doesn't work:
function Doit(){
var str = $('#test').text();
var pattern = /(?B?)([\s\S]*?)(==?=)/g;
var result = str.match(pattern);
for (var i = 0; i < result.length; i++) {
$('#test').html(result);
};
}
? has a special meaning in regex which matches preceding character 0 or 1 time..
So, ? should be escaped with \?
So the regex should be
(?:\?B\?)(.*?)(?:==\?=)
[\s\S] has no effect and is similar to .
The metacharacter ? needs escaping, i.e. \? so it is treated as a literal ?.
[\s\S] is important as it matches all characters including newlines.
var m,
pattern = /\?B\?([\s\S]*?)==\?=/g;
while ( m = pattern.exec( str ) ) {
console.log( m[1] );
}
// IObH4cPM5dLJIA
// x+HYyO3JIC4uLg
// LiDH4djj5s3Hyg
// Rlc6IOTP5skgKA
Or a longer but perhaps clearer way of writing the above loop:
m = pattern.exec( str );
while ( m != null ) {
console.log( m[1] );
m = pattern.exec( str );
}
The String match method does not return capture groups when the global flag is used, but only the full match itself.
Instead, the capture group matches of a global match can be collected from multiple calls to the RegExp exec method. Index 0 of a match is the full match, and the further indices correspond to each capture group match. See MDN exec.

Categories

Resources