Whitespace causing problems in regex to capture addresses

Whitespace causing problems in regex to capture addresses - javascript

I'm having trouble creating regex to capture icelandic home addresses.
Icelandic addresses can have a couple of formats
address 3
address 3b
add-ress
add-ress 2453
ad dr ess
Basically almost any form of a sentence and then an optional number and letter.
I have come up with the following regex.
^(\D+)\s*?(\d+\w*)?
Now this works pretty well except that the \D+ is greedy and always consumes the whitespace between the number and the street/house name.
I've tried many different quantifiers and also tried positive and negative lookups without success.
I know I can always trim the whitespace from the address after this has been captured in code but i want to know if there is any way to do this properly using Regex.

I would just group the separating space with the optional number group, but make sure it's excluded from the captured group.
^(\D+)(?:\s+(\d+\w*))?$

Instead of "one or more non-digits" (\D+), you want "one or more non-digits, of which the last one is also non-whitespace", i.e. "zero or more non-digits, plus one non-whitespace–non-digit" (\D*[^\d\s]):
^(\D*[^\d\s])\s*?(\d+\w*)?

Related

Regex for bible references

I am working on some code for an online bible. I need to identify when references are written out. I have looked all through stackoverflow and tried various regex examples but they all seem to fail with single books (eg Jude) as they require a number to proceed the book name. Here is my solution so far :
/((?:(I+|1st|2nd|3rd|First|Second|Third|1|2|3))?( )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Jude|Revelation))(([ .)\n|])([^a-zA-Z]))([\d])?([:\d])?([:\d])?/gi;
Here is the regex code with some sample text to match:
https://regexr.com/5pfg3
On the above you will notice, Jude if double spaced will work. If I put a full stop after it will work. I know the issue is this section :
(([ .)\n|])([^a-zA-Z]))
What I want is to match spaces, brackets, new lines BUT not a letter.

It does not match as it expects 2 characters using (([ .)\n|])([^a-zA-Z])) where the second one can not be a char a-zA-Z due to the negated character class, so it can not match the s in Jude some.
What you might do is make the character class in the second part optional, if you intent to keep all the capture groups.
You could also add word boundaries \b to make the pattern a bit more performant as it is right now.
See a regex demo
(Note that Jude is listed twice in the alternation)
If you only want to use 3 groups, you can write the first part as:
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?
The second part will be the alternation with the names, and in the 3rd part you can match one of the character class followed by the digit part and make that optional as a whole (so you don't match a trailing space or char after the word without the digits).
(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
The full pattern will look like
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation)\b(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
Regex demo of the full pattern

Regex - Only detect if all single digits in all four octets [duplicate]

Overview:
I am trying to combine two REGEX queries into one:
\d+\.\d+\.\d+\.\d+
^(?!(10\.|169\.)).*$
I wrote this as a two part query. The first part would isolate IPs in a block of text and after I copy and paste this I select everything and that does not being with a 10 or 169.
Questions:
It seems like I am over complicating this:
Can anybody see a better way to do this?
Is there a way to combine these two queries?

Sure. Just put the anchored negative look ahead at the start:
^(?!10\.|169\.)\d+\.\d+\.\d+\.\d+$
Note: Unnecessary brackets have been removed.
To match within a line, ie remove the anchors and use a "word boundary" \b as the anchor:
\b(?!10\.|169\.)\d+\.\d+\.\d+\.\d+

A quick-and-gimme-regex style answer
Basic one (whole string looks like an IP): ^\d+\.\d+\.\d+\.\d+$
Lite (period-separated 4-digit chunks, a whole word): \b\d+\.\d+\.\d+\.\d+\b
Medium (excluding junk like 1.2.4.6.7.9.0): (?<!\d\.)\b\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 1 (not starting with 10 or 169): (?<!\d\.)\b(?!(?:1(?:0|69))\.)\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 2 (not ending with 8 or 10): (?<!\d\.)\b\d+\.\d+\.\d+\.(?!(?:8|10)\b)\d+\b(?!\.\d+)
Details for the curious
The \b is a word boundary that makes it possible to match exact "words" (entities consisting of [a-zA-Z0-9_] characteters) inside a longer text. So, if we do not want to match 12.12.23.56 inside g12.12.23.56g, we use the Lite version.
The lookarounds together with the word boundary, make it possible to further restrict the matches. (?<!\d\.) - a negative lookbehind - and a (?!\.\d+) - a negative lookahead - will fail a match if the IP-resembling substring is preceded with a digit+. or followed with a .+digit. So, we do not match 12.12.34.56.78.90899-like entities with this regex. Choose Medium regex for that case.
Now, you need to restrict the matches to those that do not start with some numeric value. You need to make use of either a lookbehind, or a lookahead. When choosing between a lookbehind or a lookahead solution, prefer the lookahead, because 1) it is less resource consuming, and 2) more flavors support it. Thus, to fail all matches where IP first number is equal to 10 or 169, we can use a negative lookahead anchored after the leading word boundary: (?!(?:1(?:0|69))\.). The syntax is (?!...) and inside, we match either 1 followed with 0 and then a ., or 1 followed with 69 and then .. Note that we could write (?!10\.|169\.) but there is some redundant backtracking overhead then, as 1 part is repeating. Best practice is to "contract" alternations so that the beginning of each branch did not repeat, make the alternation group more linear. So, use Advanced 1 regex version to get those IPs.
A similar case is the Advanced 2 regex for getting some IPs that do not end with some value.

javascript regex requiring at least one letter, one number and prevents from adding certain words

I'm trying to modify my regex which requires user to type at least one letter and one digit which looks like this :
new RegExp('^(?=.*[a-z])(?=.*[0-9]).+$')
And I want to prevent user from using certain words like email address part before #.
So let's assume the email address is example#example.com
I want to force user to use a string that doesn't contain example in it (any part of the string)
This is what I have so far:
\b(?:(?!example)\w)+\b
But it doesn't really force the user to use at least one character and one digit.
When I'm trying to restric it I'm ending up with this:
\b(?:(?!example).*[a-z].*[0-9])+\b
But now the strings must follow the order of example then a [a-z] and then [0-9]
Any help greatly appreciated
Thanks!

In your sample the negative lookahead disallows example only at start of the string. Just add .*? as joker to disallow the word anywhere in the string and use word boundaries \b if needed.
/^(?!.*?example)(?=\D*\d)[^a-z]*[a-z].*$/i
(?!.*?example) first lookahead disallows example anywhere in the line
(?=\D*\d) second lookahead requires a digit after any amount of \D non-digits.
[^a-z]*[a-z] matches any amount of non-alphetic characters until an alphabetic.
See demo at regex101
Actually you just need two lookaheads for being independent of condition. One for the required digit and one for the word. The requirement of alphabetic can be done inside the pattern.
Lookarounds are zero-length assertions triggered at a certain position.

Regular Expressions - Match all alphanumeric characters except individual numbers

I would like to create a RegEx to match only english alphanumeric characters but ignore (or discard) isolated numbers in Ruby (and if possible in JS too).
Examples:
1) I would like the following to be matched:
4chan
9gag
test91323432
asf5asdfaf35edfdfad
afafaffe
But not:
92342424
343424
34432
and so on..
The above is exactly what I would want.
Edit: I deleted the second sub-question. Just focus on the first one, thank you very much for your answers!!
Sorry, my regex skills aren't that great (hence this question!)
Thank you.

You can try the following expression (works both in Ruby and Javascript):
^(?!^\d+$)[[:alnum:]]+$
This first ensures the string is not just digits by using a negative look ahead (?!^[0-9]+$), then it matches one or more alphanumeric character, Unicode characters are supported which means this works with French letters too.
EDIT: If you only want English alphabet:
^(?!^\d+$)\w+$
Rubular Demo

For any Latin letters:
/(?=.*\p{Alpha})\p{Alnum}+/

I'm pretty sure that you can't do what you want to do with one regex. A single alpha character, anywhere in a group of numbers, will make it a valid match, and there is no way to represent that in regex, because what you are really saying is something along the lines of "a letter is required at the front of this word, but only if there isn't a letter in the middle or at the end", and regex won't do that.
Your best bet is to do two passes:
one that matches your alphanumeric, plus special "French" characters (pattern: TBD, based on what special characters you want to accept), and
one that matches numbers only (pattern: would include [0-9]+ . . . need more information about the specific situation to give you a final, complete regex)
The values that you want in the end would need to pass the first regex and fail the second one.
Also . . .
To give you a better answer, we'll need to know a couple of things:
Are you testing that an entire string matches the pattern?
Are you trying to capture a single instance of the pattern in a bigger string?
Are you trying to capture all of the instances of the pattern in a bigger string?
The answers to those questions have a big impact on the final regex pattern that you will need.
And, finally . . .
A note on the "French" characters . . . you need to be very specific about which special characters are acceptable and which aren't. There are three main approaches to special character matching in regex: groups, additive, and subtractive
groups - these are characters that represent a preset group of characters in the version of regex that you are using. For example, \s matches all whitespaces
additive - this is the process of listing out each acceptable character (or range of characters) in your regex. This is better when you have a small group of acceptable characters
subtractive - this is the process of listing out each UNacceptable character (or range of characters) in your regex. This is better when you have a large group of acceptable characters
If you can clear up some of these questions, we should be able to give you a better answer.

Maybe this ^(?![0-9]+$)[a-zA-Z0-9\x80-\xa5]+$
Edit - fixed cut&paste error and added Extended character range \x80-\xa5
which includes the accent chars (depending on locale set, the figures may be different)

RegEx in JS to find No 3 Identical consecutive characters

How to find a sequence of 3 characters, 'abb' is valid while 'abbb' is not valid, in JS using Regex (could be alphabets,numerics and non alpha numerics).
This question is a variation of the question that I have asked in here : How to combine these regex for javascript.
This is wrong : /(^([0-9a-zA-Z]|[^0-9a-zA-Z]))\1\1/ , so what is the right way to do it?

This depends on what you actually mean. If you only want to match three non-identical characters (that is, if abb is valid for you), you can use this negative lookahead:
(?!(.)\1\1).{3}
It first asserts, that the current position is not followed by three times the same character. Then it matches those three characters.
If you really want to match 3 different characters (only stuff like abc), it gets a bit more complicated. Use these two negative lookaheads instead:
(.)(?!\1)(.)(?!\1|\2).
First match one character. Then we assert, the this is not followed by the same character. If so, we match another character. Then we assert that these are followed neither by the first nor the second character. Then we match a third character.
Note that those negative lookaheads ((?!...)) do not consume any characters. That is why they are called lookaheads. They just check what is coming next (or in this case what is not coming next) and then the regex continues from where it left of. Here is a good tutorial.
Note also that this matches anything but line breaks, or really anything if you use the DOTALL or SINGLELINE option. Since you are using JavaScript you can just activate the option by appending s after the regexes closing delimiter. If (for some reason) you don't want to use this option, replace the .s by [\s\S] (this always matches any character).
Update:
After clarification in the comments, I realised that you do not want to find three non-identical characters, but instead you want to assert that your string does not contain three identical (and consecutive) characters.
This is a bit easier, and closer to your former question, since it only requires one negative lookahead. What we do is this: we search the string from the beginning for three consecutive identical characters. But since we want to assert that these do not exist we wrap this in a negative lookahead:
^(?!.*(.)\1\1)
The lookahead is anchored to the beginning of the string, so this is the only place where we will look. The pattern in the lookahead then tries to find three identical characters from any position in the string (because of the .*; the identical characters are matched in the same way as in your previous question). If the pattern finds these, the negative lookahead will thus fail, and so the string will be invalid. If not three identical characters can be found, the inner pattern will never match, so the negative lookahead will succeed.

To find non-three-identical characters use regex pattern
([\s\S])(?!\1\1)[\s\S]{2}

Develop Reference

JavaScript is the programming language of the Web.