RegExp Match all text parts except given words

RegExp Match all text parts except given words - javascript

I have a text and I need to match all text parts except given words with regexp
For example if text is ' Something went wrong and I could not do anything ' and given words are 'and' and 'not' then the result must be ['Something went wrong', 'I could', 'do anything']
Please don't advise me to use string.split() or string.replace() and etc. I know a several ways how I can do this with build-in methods. I'm wonder if there a regex which can do this, when I will execute text.math(/regexp/g)
Please note that the regular expression must work at least in Chrome, Firefox and Safari versions not lower than the current one by 3! At the moment of asking this question the actual versions are 100.0, 98.0.2 and 15.3 respectively. For example you can not use lookbehind feature in Safari
Please, before answering my question, go to https://regexr.com/ and check your answer!. Your regular expression should highlight all parts of a sentence, including spaces between words of need parts and except empty spaces around need parts, except for the given words
Before asking this question I tried to do my own search but this links didn't help me. I also tried non accepted answers:
Match everything except for specified strings
Regex: match everything but a specific pattern
Regex to match all words except a given list
Regex to match all words except a given list (2)
Need to find a regular expression for any word except word1 or word2
Matching all words except one
Javascript match eveything except given words

It's possible with only using match and lookaheads in javascript.
/\b(?=\w)(?!(?:and|not)\b).*?(?=\s+(?:and|not)\b|\s*$)/gi
Test on RegExr here
Basically match the start of a word that's not a restricted word
\b(?=\w)(?!(?:and|not)\b)
Then a lazy match till the next whitespaces and restricted word, or the end of the line without including last whitespaces.
.*?(?=\s+(?:and|not)\b|\s*$)
Test Snippet :
const re = /\b(?=\w)(?!(?:and|not)\b).*?(?=\s+(?:and|not)\b|\s*$)/gi
let str = ` Something went wrong and I could not do anything `;
let arr = str.match(re);
console.log(arr);

See Edit further down.
You can use this regex, which only use look ahead:
/(?!and|not)\b.*?(?=and|not|$)/g
Explanation:
(?!and|not) - negative look ahead for and or not
\b - match word boundary, to prevent matching nd and ot
.*? - match any char zero or more times, as few as possible
(?=and|not|$) - look ahead for and or not or end of text
If your text has multiple lines you can add the m flag (multiline). Alternatively you can replace dot (.) with [\s\S].
Edit:
I have changed it a little so spaces around the forbidden words are removed:
/(?!and|not)\b\w.*?(?= and| not|$)/g
I have added a \w character match to push the start of the match after the space and added spaces in the look ahead.
Edit2: (to handle multiple spaces around words):
You were very close! All you need is a \s* before the dollar sign and specified words:
/(?!and|not|\s)\b.*?(?=\s*(and|not|$))/g
Updated link: regexr.com

Related

RegExp capturing non-match

I have a regex for a game that should match strings in the form of go [anything] or [cardinal direction], and capture either the [anything] or the [cardinal direction]. For example, the following would match:
go north
go foo
north
And the following would not match:
foo
go
I was able to do this using two separate regexes: /^(?:go (.+))$/ to match the first case, and /^(north|east|south|west)$/ to match the second case. I tried to combine the regexes to be /^(?:go (.+))|(north|east|south|west)$/. The regex matches all of my test cases correctly, but it doesn't correctly capture for the second case. I tried plugging the regex into RegExr and noticed that even though the first case wasn't being matched against, it was still being captured.
How can I correct this?

Try using the positive lookbehind feature to find the word "go".
(north|east|south|west|(?<=go ).+)$
Note that this solution prevents you from including ^ at the start of the regex, because the text "go" is not actually included in the group.

You have to move the closing parenthesis to the end of the pattern to have both patterns between anchors, or else you would allow a match before one of the cardinal directions and it would still capture the cardinal direction at the end of the string.
Then in the JavaScript you can check for the group 1 or group 2 value.
^(?:go (.+)|(north|east|south|west))$
^
Regex demo
Using a lookbehind assertion (if supported), you might also get a match only instead of capture groups.
In that case, you can match the rest of the line, asserting go to the left at the start of the string, or match only 1 of the cardinal directions:
(?<=^go ).+|^(?:north|east|south|west)$
Regex demo

Regex for bible references

I am working on some code for an online bible. I need to identify when references are written out. I have looked all through stackoverflow and tried various regex examples but they all seem to fail with single books (eg Jude) as they require a number to proceed the book name. Here is my solution so far :
/((?:(I+|1st|2nd|3rd|First|Second|Third|1|2|3))?( )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Jude|Revelation))(([ .)\n|])([^a-zA-Z]))([\d])?([:\d])?([:\d])?/gi;
Here is the regex code with some sample text to match:
https://regexr.com/5pfg3
On the above you will notice, Jude if double spaced will work. If I put a full stop after it will work. I know the issue is this section :
(([ .)\n|])([^a-zA-Z]))
What I want is to match spaces, brackets, new lines BUT not a letter.

It does not match as it expects 2 characters using (([ .)\n|])([^a-zA-Z])) where the second one can not be a char a-zA-Z due to the negated character class, so it can not match the s in Jude some.
What you might do is make the character class in the second part optional, if you intent to keep all the capture groups.
You could also add word boundaries \b to make the pattern a bit more performant as it is right now.
See a regex demo
(Note that Jude is listed twice in the alternation)
If you only want to use 3 groups, you can write the first part as:
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?
The second part will be the alternation with the names, and in the 3rd part you can match one of the character class followed by the digit part and make that optional as a whole (so you don't match a trailing space or char after the word without the digits).
(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
The full pattern will look like
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation)\b(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
Regex demo of the full pattern

Finding words between special characters using Unicode regex

I have a working regular expression which matches the words below.
Input:
(T1.Test)
(AT.Test)
Match:
T1.Test
AT.Test
But when I try replacing /w with unicode \p{L}, the regex does not work properly anymore.
Current expression: /(?:\w+\()+|\b(\p{L}+(?:\.\p{L}+)?)\b(?!')/gu
Input:
(T1.Test)
(AT.Test)
(ワーク.Test)
Match:
Test
Test
Test
How do I make my regex works properly now it has unicode flag?
My expected output should be:
T1.Test
AT.Test
ワーク.Test

First of all \p{L} does not catch numbers, so (T1.Test) will not be matched, while with \w would be.
Your regex is diveded in two big OR parts "1 | 2":
(?:\w+\()+ this non capturing group is matching anything of the shape anyAmmountOfLetter(. If this has success will totally ignore the rest of the regex, I don't know if it was intentional. This for example will trigger your regex: aaa(333.6780) with aaa( as full match, but 0 groups as you are not capturing it.
\b(\p{L}+(?:\.\p{L}+)?)\b(?!') this requires that you start your expression with a word boundary. But \b is valid in between two characters (Regex Tutorial) only if one is a word character an the other is not.
In your case, your starting round bracket will not be matched against the word boundary so (クーク.Test) will not work, but 3クーク.Test) will.
For fix that you can use only the second part (if the first is not really needed for checking something else of what you had shown in the question inputs):
// slight edited, can use digits: (3123.123) => 3123.123
input.match(/[\b]*\(([\d\p{L}]+(?:\.[\d\p{L}]+)?)\)[\b]*(?!')/gu)
// slight edited, must start with letter: (A1.Test) works, (1A.Test) doesn't
input.match(/[\b]*\((\p{L}[\d\p{L}]*(?:\.[\d\p{L}]+)?)\)[\b]*(?!')/gu)
Also the last part \b(?!') is optional for the input cases you gave, but I suppose it is usefull for other purposes.
If you want to keep the regex simple for those inputs, this would also work:
// can use digits: (3123.123) => 3123.123
input.match(/\(([\p{L}\d]+(?:\.[\p{L}\d]+))\)/gu)
// must start with letter: (A1.Test) works, (1A.Test) doesn't
input.match(/\((\p{L}[\p{L}\d]*(?:\.[\p{L}\d]+))\)/gu)

Exclude characters from [ ] group regex, while still looking for characters

I want to match all the words that do not contain letter l. I tried this:
[a-z^k]+
But apparently ^ only works right behind [. If it was just letter l, I guess this would do:
[a-km-z]+
Of course apart from the fact that it just treats l-words as two words:
But this is not the real concern, the question remains just as in title:
Q: How do I search for list of characters, but exclude another list of characters?

You need to use \b word boundary to make sure that the match doesn't start and end within words.
\b[a-km-z]+\b
Alternatively, you can create an exclusion list using lookahead.
\b(?:(?![l])[a-z])+\b
Demo on regex101

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.

Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Develop Reference

JavaScript is the programming language of the Web.

RegExp Match all text parts except given words - javascript

Related

RegExp capturing non-match

Regex for bible references

Finding words between special characters using Unicode regex

Exclude characters from [ ] group regex, while still looking for characters

Regexp: excluding a word but including non-standard punctuation

Categories

Resources