I'm trying to extract groups of numbers from a string.
These numbers can either be on their own or as a range in the format \d+ - \d+, while the range indicator between the two numbers can vary and the numbers can either have the prefix M- or STR . These groups can occur 1 to n times in a given string, but the matching should stop if a group is followed by any character that is not a number, whitespace or one of the prefixes mentioned above, even if further numbers can be found afterwards.
As an example, the following lines
01
05,07
05, 7
M-01, M-12
311,STR 02
M-56
STR 17
01 - Random String 25-31 Random other string
M-04 Random String 01
M-17,3,148,14 to 31
M-17,3,STR 148,14 to 31 - Random String
M-17,3,148,14- 31 Random, String 02 Random, other string
STR 17,3,12 to 18, 148 ,M-14- 31 : Random String 02
Should return
01
05;07
05;7
01;12
311;02
56
17
01
04
17;3;148;14 to 31
17;3;148;14 to 31
17;3;148;14- 31
17;3;12 to 18;148;14- 31
I'm using javascript and can almost get a correct result by running
var pattern = /(\d+)\s?(?:-|~|to)?\s?(\d+)?/ig
while (result = pattern.exec(line)) {console.log(result)}
but I can't figure out how to not match numbers after the first string, i.e. M-17,3,148,14 to 31 - Random string 46 Random string would return the values 17;3;148;14 to 31;46, while 46 should not be matched.
I'm not really concerned over the format of the results since I'm sanitizing them anyway atferwards, so it doesn't matter if '03 ' comes back as '03' or '03 '. This is also true for number ranges, 15 - 17 can either be returned as 15 - 17 or, like in the example above, use capturing groups to determine the upper and lower bound, but I still need be able to tell if two numbers are separate or a range, so 5,8,10-12 can't be returned as 5;8;10;12.
My ultimate goal is to extract all possible values in each line. After I extracted all number ranges, I loop through each result to get all possible values, e.g. 5,8,10-12 would become 5;8;10;11;12.
If it is somehow possible, and this is purely optional, I'd also like to preserve the string after the last number range, e.g. STR 14, 23 Some String 18 Some other string should return in 14;23 and separately Some String 18 Some other string.
I'd be grateful if anybody has an idea on how to solve this.
Here's my attempt.
[
'01',
'05,07',
'05, 7',
'M-01, M-12',
'311,STR 02',
'M-56',
'STR 17',
'01 - Random String 25-31 Random other string',
'M-04 Random String 01',
'M-17,3,148,14 to 31',
'M-17,3,STR 148,14 to 31 - Random String',
'M-17,3,148,14- 31 Random, String 02 Random, other string',
'STR 17,3,12 to 18, 148 ,M-14- 31 : Random String 02',
'14 ~ 16',
'Random String 15',
'1to3',
'M-01 to STR 6',
'17 56'
].forEach(function(str) {
var rangeRe = /(?:\s*,\s*)(?:M-|STR )?(\d+)(?:\s*(?:-|~|to)\s*(\d+))?/g,
ranges = [],
lastIndex = 1,
match;
str = ',' + str;
while (match = rangeRe.exec(str)) {
// Push a lower and upper bound onto the list of ranges
ranges.push([+match[1], +(match[2] || match[1])]);
lastIndex = rangeRe.lastIndex;
}
// Log the original string, the ranges and the remainder
console.log([
str.slice(1),
ranges.map(function(pair) {
return pair[0] + '-' + pair[1];
}).join(' ; '),
str.slice(lastIndex)
]);
});
Here are the rules I've followed:
Numbers consists of consecutive digits.
A range consists of either a single number or a pair of numbers.
If a range features a pair they can be separated by -, ~ or to, plus arbitrary whitespace either side the separator.
A range (note range, not number) can be prefixed by M- or STR. No extra whitespace is permitted between the prefix and the range.
Ranges are separated by , plus arbitrary whitespace either side of the ,.
Each range is parsed into an array pair consisting of the lower and upper bound. For a single-number range the same value is used for both bounds.
I've used the statefulness of exec. Each iteration of the loop begins matching where the previous match left off. The lastIndex is tracked so that we can generate the remaining 'random string' at the end.
I add a , out the front of the string before I start. This allows the RegExp to assume that all ranges start with a ,, avoiding the need for a special case of the first range.
A key difference from some of the RegExps that you posted was that I made the entire 'range separator and upper bound' section optional as a unit, rather than making them individually optional. The result of this is that an input like 17 56 would be treat the 56 as the 'random string' and not as an upper bound. The range would be treated as 17-17.
So, after getting a coffee I think I figured out a something close to a solution:
function extractNumbers(line){
var str = line.replace(/(?:M-\s?|STR )(\d+)/ig,'$1')
var rightpart = str.match(/([a-x].*)/i)
var leftpart = str.replace(rightpart[1],'')
var pattern = /(\d+)\s?(?:-|~|to)?\s?(\d+)?/ig
while (result = pattern.exec(leftpart)) {console.log(result)}
console.log(rightpart[1])
}
This function outputs all number ranges and then the rest of the string to the console. There are chances for false positives because it first replaces all occurrences of M- and STR followed by a number, even if they occur in the right part of the string. The chances of this exact sequence of characters occurring in the right part are probably small, but still..
If anybody has an answer to the original question or an idea on how to eliminate the chance for false positives, I would love to see it.
Related
I am trying to split the d attribute on a path tag in an svg file into tokens.
This one is relatively easy:
d = "M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7"
tokens = d.split(/[\s,]/)
But this is also a valid d attribute:
d = "M2-12C5,15,21,19,27-2C17,12-3,40,5,7"
The tricky parts are letters and numbers are no longer separated and negative numbers use only the negative sign as the separator. How can I create a regex that handles this?
The rules seem to be:
split wherever there is white space or a comma
split numerics from letters (and keep "-" with the numeric)
I know I can use lookaround, for example:
tokens = pathdef.split(/(?<=\d)(?=\D)|(?<=\D)(?=\d)/)
I'm having trouble forming a single regex that also splits on the minus signs and keeps the minus sign with the numbers.
The above code should tokenize as follows:
[ 'M', '2', '-12', 'C', '5', '15', '21', '19', '27', '-2', 'C', '17', '12', '-3', '40', '5', '7' ]
Brief
Unfortunately, JavaScript doesn't allow lookbehinds, so your options are fairly limited and the regex in the Other Regex Engines section below will not work for you (albeit it will with some other regex engines).
Other Regex Engines
Note: The regex in this section (Other Regex Engines) will not work in Javascript. See the JavaScript solution in the Code section instead.
I think with your original regex you were trying to get to:
[, ]|(?<![, ])(?=-|(?<=[a-z])\d|(?<=\d)[a-z])
This regex allows you to split on those matches (, or , or locations that are followed by -, or locations where a letter precedes a digit or locations where a digit precedes a letter).
Code
var a = [
"M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7",
"M2-12C5,15,21,19,27-2C17,12-3,40,5,7"
]
var r = /-?(?:\d*\.)?\d+|[a-z]/gi
a.forEach(function(s){
console.log(s.match(r));
});
Explanation
-?\d+(?:\.\d+)?|[a-z] Match either of the following
-?\d+(?:\.\d+)?
-? Match - literally zero or one time
(?:\d*\.)? Match the following zero or one time
\d* Match any number of digits
\. Match a literal dot
\d+ Match one or more digits
[a-z] Match any character in the range from a-z (any lowercase alpha character - since i modifier is used this also matches uppercase variants of those letters)
I added (?:\d*\.)? because (to the best of my knowledge) you can have decimal number values in SVG d attributes.
Note: Changed the original regex portion of \d+(?:\.\d+)? to (?:\d*\.)?\d+ in order to catch numbers that don't have the whole number part such as .5 as per #Thomas (see comments below question).
You could go for
-?\d+|[A-Z]
See a demo on regex101.com.
Here, instead of splitting, you could very well just match them:
matches = "M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7".match(/-?\d+|[A-Z]/g)
# matches holds the different tokens
I'm weak with regexes but have put together the following regex which selects when my pattern is met, the problem is that i need to select any characters that do not fit the pattern.
/^\d{1,2}[ ]\d{1,2}[ ]\d{1,2}[ ][AB]/i
Correct pattern is:
## ## ## A|B aka [0 < x <= 90]*space*[0 < x <= 90] [0 < x <= 90] [A|B]
EG:
12 34 56 A → good
12 34 56 B → good
12 34 5.6 A → bad - select .
12 34 5.6 C → bad - select . and C
1A 23 45 6 → bad - select A and 6
Edit:
As my impression was that regex is used to perform validation of both characters and pattern/sequence at the same time. The simple question is how to select characters that do not fit the category of non negative numbers, spaces and distinct characters.
Answer 1
Brief
This isn't really realizable with 1 regex due to the nature of the regex. This answer provides a regex that will capture the last incorrect entry. For multiple incorrect entries, a loop must be used. You can correct the incorrect entries by running some code logic on the resulting captured groups to determine why it isn't valid.
My ultimate suggestion would be to split the string by a known delimiter (in this case the space character and then using some logic (or even a small regex) to determine why it's incorrect and how to fix it, as seen in Answer 2.
Non-matches
The following logic is applied in my second answer.
For any users wondering what I did to catch incorrect matches: At the most basic level, all this regex is doing is adding |(.*) to every subsection of the regex. Some sections required additional changes for catching specific invalid string formats, but the |(.*) or slight modifications of this will likely solve anyone else's issues.
Other modifications include:
Using opposite tokens
For example: Matching a digit
Original regex: \d
Opposite regex \D
For example: Matching a digit or whitepace
Original regex: [\d\s]
Opposite regex: [^\d\s]
Note [\D\S] is incorrect as it matches both sets of characters, thus, any non-whitespace or non-digit character (since non-whitespace includes digits and non-digits include whitespace, both will be matched)
Negative lookaheads
For example: Catching up to 31 days in a month
Original regex \b(?:[0-2]?\d|3[01])\b
Opposite regex: \b(?![0-2]?\d\b|3[01]\b)\d+\b
Code
First, creating a more correct regex that also ensures 0 < x <= 90 as per the OP's question.
^(?:(?:[0-8]?\d|90) ){3}[AB]$
See regex in use here
^(?:(?:(?:[0-8]?\d|90) |(\S*) ?)){3}(?:[AB]|(.*))$
Note: This regex uses the mi flags (multiline - assuming input is in that format, and case-insensitive)
Other Formats
Realistically, this following regex would be ideal. Unfortunately, JavaScript doesn't support some of the tokens used in the regex, but I feel it may be useful to the OP or other users that see this question.
See regex in use here
^(?:(?:(?:[0-8]?\d|90) |(?<n>\S*?) |(?<n>\S*?) ?)){3}(?:(?<n>\S*) )?(?:[AB]|(.*))$
Results
Input
The first section (sections separated by the extra newline/break) shows valid strings, while the second shows invalid strings.
0 45 90 A
0 45 90 B
-1 45 90 A
0 45 91 A
12 34 5.6 A
12 34 56 C
1A 23 45 6
11 1A 12 12 A
12 12 A
12 12 A
Output
0 45 90 A VALID
0 45 90 B VALID
-1 45 90 A INVALID: -1
0 45 91 A INVALID: 91
12 34 5.6 A INVALID: 5.6
12 34 56 C INVALID: C
1A 23 45 6 INVALID: 1A, 6
11 1A 12 12 A INVALID: 12 A
12 12 A INVALID: (missing value)
12 12 A INVALID: A, (missing value)
Note: The last entry shows an odd output, but that's due to a limitation with JavaScript's regex engine. The Other Formats section describes this and another method to use to properly catch these cases (using a different regex engine)
Explanation
This uses a simple | (OR) and captures the incorrect matches into a capture group.
^ Assert position at the start of the line
(?:(?:(?:[0-8]?\d|90) |(\S*) ?)){3} Match the following exactly 3 times
(?:(?:[0-8]?\d|90) |(.+)) Match either of the following
(?:[0-8]?\d|90) Match either of the following, followed by a space character literally
[0-8]?\d Match between zero and one of the characters in the set 0-8 (a digit between 0 and 8), followed by any digit
90 Match 90 literally
(\S*) ? Capture any non-whitespace character one or more times into capture group 1, followed by zero or one space character literally
(?:[AB]|(.*)) Match either of the following
[AB] Match any character present in the set (A or B)
(.*) Capture any character any number of times into capture group 2
$ Assert position at the end of the line
Answer 2
Brief
This method splits the string on the given delimiter and tests each section for the proper set of characters. It outputs a message if the value is incorrect. You would likely replace the console outputs with whatever logic you want use.
Code
var arr = [
"0 45 90 A",
"0 45 90 B",
"-1 45 90 A",
"0 45 91 A",
"12 34 5.6 A",
"12 34 56 C",
"1A 23 45 6",
"11 1A 12 12 A",
"12 12 A",
"12 12 A"
];
arr.forEach(function(e) {
var s = e.split(" ");
var l = s.pop();
var numElements = 3;
var maxNum = 90;
var syntaxErrors = [];
if(s.length != numElements) {
syntaxErrors.push(`Invalid number of elements: Number = ${numElements}, Given = ${s.length}`);
}
s.forEach(function(v) {
if(v.match(/\D/)) {
syntaxErrors.push(`Invalid value "${v}" exists`);
} else if(!v.length) {
syntaxErrors.push(`An empty value or double space exists`);
} else if(Number(v) > maxNum) {
syntaxErrors.push(`Value greater than ${maxNum} exists: ${v}`);
}
});
if(l.match(/[^AB]/)) {
syntaxErrors.push(`Last element ${l} in "${e}" is invalid`);
}
if(syntaxErrors.length) {
console.log(`"${e}" [\n\t${syntaxErrors.join('\n\t')}\n]`);
} else {
console.log(`No errors found in "${e}"`);
}
});
I want to setup some validation on an <input> to prevent the user from entering wrong characters. For this I am using ng-pattern. It currently disables the user from entering wrong characters, but I also noticed this is not the expected behavior so I am also planning on creating a directive.
I am using
AngularJS: 1.6.1
What should the regex match
Below are the requirements for the regex string:
Number 0x to xx (example 01 to 93)
Number x to xx (example 9 to 60)
Characters are not allowed
Special characters are not allowed
Notice:
the 'x' is variable and could be any number between 0 and 100.
The number on the place of 'x' is variable so if it is possible to create a string that is easily changeable that would be appreciated!
What I tried
A few regex strings I tried where:
1) ^0*([0-9]\d{1,2})$
--> Does match 01 but not 1
--> Does match 32 where it shouldn't
2) ^[1-9][0-9]?$|^31$
--> Does match 1 but not 01
--> Does match 32 where it shouldn't
For testing I am using https://regex101.com/tests.
What am I missing in my attempts?
If your aim is to match 0 to 100, here's a way, based on the previous solution.
\b(0?[1-9]|[1-9][0-9]|100)\b
Basically, there's 3 parts to that match...
0?[1-9] Addresses numbers 1 to 9, by mentionning that 0 migh be present
[1-9][0-9] covers number 10 to 99, the [1-9] representing the tens
100 covers for 100
Here's an example of it
Where you to require to set the higher boundary to 42, the middle part of the expression would become [1-3][0-9] (covering 10 to 39) and the last part would become 4[0-2] (covering 40 to 42) like so:
\b(0?[1-9]|[1-3][0-9]|4[0-2])\b
This should work:
^(0?[1-9]|[12][0-9]|3[01])$
https://regex101.com/r/BYSDwz/1
In general, I don't think I've come across a generic solution to this problem. How do you match a string that can be a range, or just a single value?
Say I want to match [complex] dates:
1999 - 2010
323 BCE - 100 CE
323 BC
1995-99
323 - 322 BC
What is the general regular expression "template" that can parse both of these cases:
The start/end date if it exists
Otherwise, just a single date
To match "1999 - 2010", you can just do
/(\d+\s*)-(\s*\d+)/ // where $1 and $2 are start and end
To match the more complex "323 BCE - 100 CE", you can do
/(\w+\s*\w+)\s*-\s*(\w+\s*\w+)/
And to match the simpler "323 BC", you can do
/\w+\s*\w+/
But how do you write one expression that first checks for the range (323 BCE - 100 CE), and if that doesn't exist, checks for a single value (323 BC), that can also handle the other examples in the list above?
By making the latter part of the match optional.
/(\w+\s*\w+)(?:\s*-\s*(\w+\s*\w+))?/
Examples (JavaScript)
"1900 - 2000".match(/(\w+\s*\w+)(?:\s*-\s*(\w+\s*\w+))?/);
//["1900 - 2000", "1900", "2000"]
"1900 BC".match(/(\w+\s*\w+)(?:\s*-\s*(\w+\s*\w+))?/);
//["1900 BC", "1900 BC", undefined]
Note the outer, optional part is made to be non-matching, so the array of results contains only the sub-matches you're interested in.
It would also be an idea to tighten the pattern efficiency-wise e.g. look for numbers rather than anything alphanumeric, and allow only single spaces (if this was acceptable) rather than zero or more.
Just to throw in another pattern that might work the way you want it;
((\d+)( [A-Za-z]+|))((-| - )\d+( [A-Za-z]+|)|)
And as with Utkanos' pattern, this might need some tightening to not match anything else.
You're probably looking for something like this:
var pattern = /(\d+)(\s*(\w+))?(\s*-\s*(\d+)(\s*(\w+))?)?/;
var strings = [
'1999 - 2010',
'323 BCE - 100 CE',
'323 BC',
'1995-99',
'323 - 322 BC'
];
for (var i=0, s; s = strings[i]; i++) {
var m = s.match(pattern);
console.log(
m[1], // beginning year
m[3], // beginning b/c/e
m[5], // end year
m[7] // end b/c/e
);
}
which outputs
1999 undefined 2010 undefined
323 BCE 100 CE
323 BC undefined undefined
1995 undefined 99 undefined
323 undefined 322 BC
The trick here is to understand that (group)? makes (group) optional. Analogous to this, (foo)+ and (foo){3} can be used to make the group match at least once or exactly three times.
Groups (foo) are, by default, capturing groups. That means their result will be contained in the array returned by String#match(). You can mark groups to be non-capture like so: (?:wont-be-captured). With this, we can modify above pattern even further:
var pattern = /(\d+)(?:\s*(\w+))?(?:\s*-\s*(\d+)(?:\s*(\w+))?)?/;
for (var i=0, s; s = strings[i]; i++) {
var m = s.match(pattern);
console.log(m[1], m[2], m[3], m[4]);
}
Alright, so I was messing around with parseInt to see how it handles values not yet initialized and I stumbled upon this gem. The below happens for any radix 24 or above.
parseInt(null, 24) === 23 // evaluates to true
I tested it in IE, Chrome and Firefox and they all alert true, so I'm thinking it must be in the specification somewhere. A quick Google search didn't give me any results so here I am, hoping someone can explain.
I remember listening to a Crockford speech where he was saying typeof null === "object" because of an oversight causing Object and Null to have a near identical type identifier in memory or something along those lines, but I can't find that video now.
Try it: http://jsfiddle.net/robert/txjwP/
Edit Correction: a higher radix returns different results, 32 returns 785077
Edit 2 From zzzzBov: [24...30]:23, 31:714695, 32:785077, 33:859935, 34:939407, 35:1023631, 36:1112745
tl;dr
Explain why parseInt(null, 24) === 23 is a true statement.
It's converting null to the string "null" and trying to convert it. For radixes 0 through 23, there are no numerals it can convert, so it returns NaN. At 24, "n", the 14th letter, is added to the numeral system. At 31, "u", the 21st letter, is added and the entire string can be decoded. At 37 on there is no longer any valid numeral set that can be generated and NaN is returned.
js> parseInt(null, 36)
1112745
>>> reduce(lambda x, y: x * 36 + y, [(string.digits + string.lowercase).index(x) for x in 'null'])
1112745
Mozilla tells us:
function parseInt converts its first
argument to a string, parses it, and
returns an integer or NaN. If not NaN,
the returned value will be the decimal
integer representation of the first
argument taken as a number in the
specified radix (base). For example, a
radix of 10 indicates to convert from
a decimal number, 8 octal, 16
hexadecimal, and so on. For radices
above 10, the letters of the alphabet
indicate numerals greater than 9. For
example, for hexadecimal numbers (base
16), A through F are used.
In the spec, 15.1.2.2/1 tells us that the conversion to string is performed using the built-in ToString, which (as per 9.8) yields "null" (not to be confused with toString, which would yield "[object Window]"!).
So, let's consider parseInt("null", 24).
Of course, this isn't a base-24 numeric string in entirety, but "n" is: it's decimal 23.
Now, parsing stops after the decimal 23 is pulled out, because "u" isn't found in the base-24 system:
If S contains any character that is
not a radix-R digit, then let Z be the
substring of S consisting of all
characters before the first such
character; otherwise, let Z be S. [15.1.2.2/11]
(And this is why parseInt(null, 23) (and lower radices) gives you NaN rather than 23: "n" is not in the base-23 system.)
Ignacio Vazquez-Abrams is correct, but lets see exactly how it works...
From 15.1.2.2 parseInt (string , radix):
When the parseInt function is called,
the following steps are taken:
Let inputString be ToString(string).
Let S be a newly created substring of inputString consisting of the first
character that is not a
StrWhiteSpaceChar and all characters
following that character. (In other
words, remove leading white space.)
Let sign be 1.
If S is not empty and the first character of S is a minus sign -, let
sign be −1.
If S is not empty and the first character of S is a plus sign + or a
minus sign -, then remove the first
character from S.
Let R = ToInt32(radix).
Let stripPrefix be true.
If R ≠ 0, then a. If R < 2 or R > 36, then return NaN. b. If R ≠ 16, let
stripPrefix be false.
Else, R = 0 a. Let R = 10.
If stripPrefix is true, then a. If the length of S is at least 2 and the
first two characters of S are either
“0x” or “0X”, then remove the first
two characters from S and let R = 16.
If S contains any character that is not a radix-R digit, then let Z be the
substring of S consisting of all
characters before the first such
character; otherwise, let Z be S.
If Z is empty, return NaN.
Let mathInt be the mathematical integer value that is represented by Z
in radix-R notation, using the letters
A-Z and a-z for digits with values 10
through 35. (However, if R is 10 and Z
contains more than 20 significant
digits, every significant digit after
the 20th may be replaced by a 0 digit,
at the option of the implementation;
and if R is not 2, 4, 8, 10, 16, or
32, then mathInt may be an
implementation-dependent approximation
to the mathematical integer value that
is represented by Z in radix-R
notation.)
Let number be the Number value for mathInt.
Return sign × number.
NOTE parseInt may interpret only a
leading portion of string as an
integer value; it ignores any
characters that cannot be interpreted
as part of the notation of an integer,
and no indication is given that any
such characters were ignored.
There are two important parts here. I bolded both of them. So first of all, we have to find out what the toString representation of null is. We need to look at Table 13 — ToString Conversions in section 9.8.0 for that information:
Great, so now we know that doing toString(null) internally yields a 'null' string. Great, but how exactly does it handle digits (characters) that aren't valid within the radix provided?
We look above to 15.1.2.2 and we see the following remark:
If S contains any character that is
not a radix-R digit, then let Z be the
substring of S consisting of all
characters before the first such
character; otherwise, let Z be S.
That means that we handle all digits PRIOR to the specified radix and ignore everything else.
Basically, doing parseInt(null, 23) is the same thing as parseInt('null', 23). The u causes the two l's to be ignored (even though they ARE part of the radix 23). Therefore, we only can only parse n, making the entire statement synonymous to parseInt('n', 23). :)
Either way, great question!
parseInt( null, 24 ) === 23
Is equivalent to
parseInt( String(null), 24 ) === 23
which is equivalent to
parseInt( "null", 24 ) === 23
The digits for base 24 are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, ..., n.
The language spec says
If S contains any character that is not a radix-R digit, then let Z be the substring of S consisting of all characters before the first such character; otherwise, let Z be S.
which is the part that ensures that C-style integer literals like 15L parse properly,
so the above is equivalent to
parseInt( "n", 24 ) === 23
"n" is the 23-rd letter of the digit list above.
Q.E.D.
I guess null gets converted to a string "null". So n is actually 23 in 'base24' (same in 'base25'+), u is invalid in 'base24' so the rest of the string null will be ignored. That's why it outputs 23 until u will become valid in 'base31'.
parseInt uses alphanumeric representation, then in base-24 "n" is valid, but "u" is invalid character, then parseInt only parses the value "n"....
parseInt("n",24) -> 23
as an example, try with this:
alert(parseInt("3x", 24))
The result will be "3".