Regex for bible references - javascript

I am working on some code for an online bible. I need to identify when references are written out. I have looked all through stackoverflow and tried various regex examples but they all seem to fail with single books (eg Jude) as they require a number to proceed the book name. Here is my solution so far :
/((?:(I+|1st|2nd|3rd|First|Second|Third|1|2|3))?( )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Jude|Revelation))(([ .)\n|])([^a-zA-Z]))([\d])?([:\d])?([:\d])?/gi;
Here is the regex code with some sample text to match:
https://regexr.com/5pfg3
On the above you will notice, Jude if double spaced will work. If I put a full stop after it will work. I know the issue is this section :
(([ .)\n|])([^a-zA-Z]))
What I want is to match spaces, brackets, new lines BUT not a letter.

It does not match as it expects 2 characters using (([ .)\n|])([^a-zA-Z])) where the second one can not be a char a-zA-Z due to the negated character class, so it can not match the s in Jude some.
What you might do is make the character class in the second part optional, if you intent to keep all the capture groups.
You could also add word boundaries \b to make the pattern a bit more performant as it is right now.
See a regex demo
(Note that Jude is listed twice in the alternation)
If you only want to use 3 groups, you can write the first part as:
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?
The second part will be the alternation with the names, and in the 3rd part you can match one of the character class followed by the digit part and make that optional as a whole (so you don't match a trailing space or char after the word without the digits).
(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
The full pattern will look like
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation)\b(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
Regex demo of the full pattern

Related

RegExp Match all text parts except given words

I have a text and I need to match all text parts except given words with regexp
For example if text is ' Something went wrong and I could not do anything ' and given words are 'and' and 'not' then the result must be ['Something went wrong', 'I could', 'do anything']
Please don't advise me to use string.split() or string.replace() and etc. I know a several ways how I can do this with build-in methods. I'm wonder if there a regex which can do this, when I will execute text.math(/regexp/g)
Please note that the regular expression must work at least in Chrome, Firefox and Safari versions not lower than the current one by 3! At the moment of asking this question the actual versions are 100.0, 98.0.2 and 15.3 respectively. For example you can not use lookbehind feature in Safari
Please, before answering my question, go to https://regexr.com/ and check your answer!. Your regular expression should highlight all parts of a sentence, including spaces between words of need parts and except empty spaces around need parts, except for the given words
Before asking this question I tried to do my own search but this links didn't help me. I also tried non accepted answers:
Match everything except for specified strings
Regex: match everything but a specific pattern
Regex to match all words except a given list
Regex to match all words except a given list (2)
Need to find a regular expression for any word except word1 or word2
Matching all words except one
Javascript match eveything except given words
It's possible with only using match and lookaheads in javascript.
/\b(?=\w)(?!(?:and|not)\b).*?(?=\s+(?:and|not)\b|\s*$)/gi
Test on RegExr here
Basically match the start of a word that's not a restricted word
\b(?=\w)(?!(?:and|not)\b)
Then a lazy match till the next whitespaces and restricted word, or the end of the line without including last whitespaces.
.*?(?=\s+(?:and|not)\b|\s*$)
Test Snippet :
const re = /\b(?=\w)(?!(?:and|not)\b).*?(?=\s+(?:and|not)\b|\s*$)/gi
let str = ` Something went wrong and I could not do anything `;
let arr = str.match(re);
console.log(arr);
See Edit further down.
You can use this regex, which only use look ahead:
/(?!and|not)\b.*?(?=and|not|$)/g
Explanation:
(?!and|not) - negative look ahead for and or not
\b - match word boundary, to prevent matching nd and ot
.*? - match any char zero or more times, as few as possible
(?=and|not|$) - look ahead for and or not or end of text
If your text has multiple lines you can add the m flag (multiline). Alternatively you can replace dot (.) with [\s\S].
Edit:
I have changed it a little so spaces around the forbidden words are removed:
/(?!and|not)\b\w.*?(?= and| not|$)/g
I have added a \w character match to push the start of the match after the space and added spaces in the look ahead.
Edit2: (to handle multiple spaces around words):
You were very close! All you need is a \s* before the dollar sign and specified words:
/(?!and|not|\s)\b.*?(?=\s*(and|not|$))/g
Updated link: regexr.com

RegExp capturing non-match

I have a regex for a game that should match strings in the form of go [anything] or [cardinal direction], and capture either the [anything] or the [cardinal direction]. For example, the following would match:
go north
go foo
north
And the following would not match:
foo
go
I was able to do this using two separate regexes: /^(?:go (.+))$/ to match the first case, and /^(north|east|south|west)$/ to match the second case. I tried to combine the regexes to be /^(?:go (.+))|(north|east|south|west)$/. The regex matches all of my test cases correctly, but it doesn't correctly capture for the second case. I tried plugging the regex into RegExr and noticed that even though the first case wasn't being matched against, it was still being captured.
How can I correct this?
Try using the positive lookbehind feature to find the word "go".
(north|east|south|west|(?<=go ).+)$
Note that this solution prevents you from including ^ at the start of the regex, because the text "go" is not actually included in the group.
You have to move the closing parenthesis to the end of the pattern to have both patterns between anchors, or else you would allow a match before one of the cardinal directions and it would still capture the cardinal direction at the end of the string.
Then in the JavaScript you can check for the group 1 or group 2 value.
^(?:go (.+)|(north|east|south|west))$
^
Regex demo
Using a lookbehind assertion (if supported), you might also get a match only instead of capture groups.
In that case, you can match the rest of the line, asserting go to the left at the start of the string, or match only 1 of the cardinal directions:
(?<=^go ).+|^(?:north|east|south|west)$
Regex demo

how to negate a capture group?

Using a javascript regexp, I would like to find strings like "/foo" or "/foo d/" but not "/foo /"; ie, "annotation character", then either word with no terminating annotation, or multiple words, where the termination comes at the end of the phrase (with no space). Complicating the situation, there are three possible annotation symbols: /, \ and |.
I've tried something like:
/(?:^|\s)([\\\/|])((?:[\w_-]+(?![^\1]+[\w_-]\1))|(?:[\w\s]+[\w](?=\1)))/g
That is, start with space, then annotation, then
word not followed by (anything but annotation) then letter and annotation... or
possibly multiple words, immediately followed by annotation character.
The problem is the [^\1]: this doesn't read as "anything but the annotation character" in the angle brackets.
I could repeat the whole phrase three times, one for each annotation character. Any better ideas?
As you've mentioned, [^\1] doesn't work - it matches anything that is not the character 1. In JavaScript, you can negate \1 by using a lookahead: (?:(?!\1).)* . This is not as efficient, but it works.
Your pattern can be written as:
([\\\/|])([\w\-]+(?:(?:(?!\1).)*[\w\-]\1)?)
Working example at Regex101
\w already contains underscore.
Instead of alternation (a|ab) I'm using an optional group (a(?:b)?) - we always match the first word, with optional further words and tags.
You may still want to include (?:^|\s) at the beginning.

Regex for adding a space or period for new sentence under certain conditions

I'm trying to create a regex (implementable in Javascript/Node.js) to:
Add a space whenever a letter or character (A-Z, a-z, !##$%^&*(), etc. but
NOT a number) is followed by a period which is then is followed by a
capital letter (with no space in between) and/or,
Add a period (.) whenever a whitespace is followed by a single capital letter (A-Z, a-z but NOT a number or character) UNLESS there is more than one capital letter such as in an acronym, and/or,
Add a period (.) whenever any character, letter or number is NOT followed by anything else in a string.
For example, in the first case:
This is a sample sentence.This is a sample new sentence.
Should become:
This is a sample sentence. This is a sample new sentence.
In the second case, for example:
This is a sample sentence This is a sample new sentence.
Should become:
This is a sample sentence. This is a sample new sentence.
But also, in the second case:
This is a sample sentence with TEST This is a sample new sentence.
Should become:
This is a sample sentence with TEST. This is a sample new sentence.
In the third case, for example:
This is a sample sentence. This is a sample new sentence
Should become:
This is a sample sentence. This is a sample new sentence.
Notice the differences in placement of periods and spacing for these examples that I am looking to search for and change.
I've searched for variants of this and found some, but nothing that fits the exact criteria listed above. I'm only worried about periods and spaces at this point in time, not other types of punctuation unless there is a more universal solution that can apply to more than just these cases. I'm looking to use this to start cleaning up the grammar in some log files and other areas.
I apologize in advance if this reads too complicated. Leave a comment and I will gladly clarify if needed.
While I should include the standard caution against using programmatic means to mess around with natural languages (which are very complex and difficult for computers to understand), a series of regexes that (when run in sequence on the string) do what you want appears below.
For the first scenario:
s/([^0-9.])\.([^0-9])/\1. \2/g
For the second scenario:
s/([^.]) ([A-Z][a-z])/\1. \2/g
For the third scenario:
s/([^.])$/\1./g
To break it down a little:
s/A/B/g means "replace every occurrence of regex A in the text with B".
(A) means "capture A so we can use it again later" (this is known as a capture group).
[^0-9.] means "match all characters that are not numeric characters or the period character". This is a negated character class.
\. matches the literal period (".") character.
$ is the end-of-line anchor - it matches the end of the string.
\1 and \2 refer to the first and second capture groups, respectively.
So, basically, what these regexes do is to capture the stuff around the region to be modified, then replace that stuff plus the region with the stuff plus the modification.
For the 1st case, use the following to match and replace with space:
(?=\.[^\d\s])
For the 2nd and 3rd cases, use the following regex to match and replace with .
(?<!\.)$|(?=\s[A-Z])

Need help to find the right regex pattern to match

my RegEx is not working the way i think, it should.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
I will use this regex in a javascript, to check if a string match with it.
Should match:
12345612-OSM34
12-OSM34
OSM56
7-OSM
OSM
Should not match:
-OSM
a-OSM
rOSMann
rOSMa
asdrOSMa
rOSM89
01-OSMann
OSMond
23OSM
45OSM678
One line, represents a string in my javascript.
https://www.regex101.com/r/xQ0zG1/3
The rules for matching:
match OSM if it stands alone
optional match if line starts with digit/s AND is followed by a -
optional match if line ends with digit/s
match all 3 above combined
no match if line starts with a character/word except OSM
no match if line end with chracter/word except OSM
I Hope someone can help.
You can use the following simplified pattern using anchors:
^(?:\d+-)?OSM\d*$
The flags needed (if matching multi-line paragraph) would be: g for global match and m for multi-line match, so that ^ and $ match the begin/end of each line.
EDIT
Changed the (\d+-) match to (?:\d+-) so that it doesn't group.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
[^a-zA-Z] In regex, you specify what you want, not what you don't want. This piece of code says there must be one character that isn't a letter. I believe what you wanted to say is to match the start of a line. You don't need to specify that there's no letter, you're about to specify what there will be on the line anyway. The start of a regex is represented with ^ (outside of brackets). You'll have to use the m flag to make the regex multi-line.
(\d+-)? means one or more digits followed by a - character. The ? means this whole block isn't required. If you don't want foreign digits, you might want to use [0-9] instead, but it's not as important. This part of the code, you got right. However, if you don't need capture blocks, you could write (?:) instead of ().
\d*(?![a-zA-Z]) uses lookahead, but you almost never need to do that. Again, specifying what you don't want is a bad idea because then I could write OSMé and it would match because you didn't specify that é is forbidden. It's much simpler to specify what is allowed. In your case since you want to match line ends. So instead, you can write \d*$ which means zero or more digits followed by the end of the line.
/^(?:\d+-)?OSM\d*$/gm is the final result.

Categories

Resources