What is the regex that properly splits SVG 'd' attributes into tokens?

What is the regex that properly splits SVG 'd' attributes into tokens? - javascript

I am trying to split the d attribute on a path tag in an svg file into tokens.
This one is relatively easy:
d = "M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7"
tokens = d.split(/[\s,]/)
But this is also a valid d attribute:
d = "M2-12C5,15,21,19,27-2C17,12-3,40,5,7"
The tricky parts are letters and numbers are no longer separated and negative numbers use only the negative sign as the separator. How can I create a regex that handles this?
The rules seem to be:
split wherever there is white space or a comma
split numerics from letters (and keep "-" with the numeric)
I know I can use lookaround, for example:
tokens = pathdef.split(/(?<=\d)(?=\D)|(?<=\D)(?=\d)/)
I'm having trouble forming a single regex that also splits on the minus signs and keeps the minus sign with the numbers.
The above code should tokenize as follows:
[ 'M', '2', '-12', 'C', '5', '15', '21', '19', '27', '-2', 'C', '17', '12', '-3', '40', '5', '7' ]

Brief
Unfortunately, JavaScript doesn't allow lookbehinds, so your options are fairly limited and the regex in the Other Regex Engines section below will not work for you (albeit it will with some other regex engines).
Other Regex Engines
Note: The regex in this section (Other Regex Engines) will not work in Javascript. See the JavaScript solution in the Code section instead.
I think with your original regex you were trying to get to:
[, ]|(?<![, ])(?=-|(?<=[a-z])\d|(?<=\d)[a-z])
This regex allows you to split on those matches (, or , or locations that are followed by -, or locations where a letter precedes a digit or locations where a digit precedes a letter).
Code
var a = [
"M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7",
"M2-12C5,15,21,19,27-2C17,12-3,40,5,7"
]
var r = /-?(?:\d*\.)?\d+|[a-z]/gi
a.forEach(function(s){
console.log(s.match(r));
});
Explanation
-?\d+(?:\.\d+)?|[a-z] Match either of the following
-?\d+(?:\.\d+)?
-? Match - literally zero or one time
(?:\d*\.)? Match the following zero or one time
\d* Match any number of digits
\. Match a literal dot
\d+ Match one or more digits
[a-z] Match any character in the range from a-z (any lowercase alpha character - since i modifier is used this also matches uppercase variants of those letters)
I added (?:\d*\.)? because (to the best of my knowledge) you can have decimal number values in SVG d attributes.
Note: Changed the original regex portion of \d+(?:\.\d+)? to (?:\d*\.)?\d+ in order to catch numbers that don't have the whole number part such as .5 as per #Thomas (see comments below question).

You could go for
-?\d+|[A-Z]
See a demo on regex101.com.
Here, instead of splitting, you could very well just match them:
matches = "M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7".match(/-?\d+|[A-Z]/g)
# matches holds the different tokens

Related

Regex to match numbers from large document in Javascript

Trying to create a regex that could match numbers from large document.
Find at least 10 continuous digits (which can go to maximum 15 digits) that could be separated by one or multiple
-
_
\s
(
)
[
]
Tried-
/(?:((\d([ \-_\s]+?)){5,8}))/
Eg:
1-2-3-4-5-6-7-8-9-0-12-34
1 2 3 4 5 6 7 8 9 0
123-456-789-0
123---456---789---987
12 34 56 78 90
12_ -34_-56--78__90

You may use
/\d(?:[-_\][()\s]*\d){9,14}/g
See the regex demo
Details
\d - a digit
(?:[-_\][()\s]*\d){9,14} - 9 to 14 repetitions of
[-_\][()\s]* - 0 or more repetitions of -, _, ], [, (, ) or whitespace
\d - a digit.
Note you do not need to escape [ inside a character class, it is parsed as a literal [ in a JS regex. However, ] must be escaped there, otherwise, it will close the character class prematurely.

Regex to select all characters that do not match a pattern

I'm weak with regexes but have put together the following regex which selects when my pattern is met, the problem is that i need to select any characters that do not fit the pattern.
/^\d{1,2}[ ]\d{1,2}[ ]\d{1,2}[ ][AB]/i
Correct pattern is:
## ## ## A|B aka [0 < x <= 90]*space*[0 < x <= 90] [0 < x <= 90] [A|B]
EG:
12 34 56 A → good
12 34 56 B → good
12 34 5.6 A → bad - select .
12 34 5.6 C → bad - select . and C
1A 23 45 6 → bad - select A and 6
Edit:
As my impression was that regex is used to perform validation of both characters and pattern/sequence at the same time. The simple question is how to select characters that do not fit the category of non negative numbers, spaces and distinct characters.

Answer 1
Brief
This isn't really realizable with 1 regex due to the nature of the regex. This answer provides a regex that will capture the last incorrect entry. For multiple incorrect entries, a loop must be used. You can correct the incorrect entries by running some code logic on the resulting captured groups to determine why it isn't valid.
My ultimate suggestion would be to split the string by a known delimiter (in this case the space character and then using some logic (or even a small regex) to determine why it's incorrect and how to fix it, as seen in Answer 2.
Non-matches
The following logic is applied in my second answer.
For any users wondering what I did to catch incorrect matches: At the most basic level, all this regex is doing is adding |(.*) to every subsection of the regex. Some sections required additional changes for catching specific invalid string formats, but the |(.*) or slight modifications of this will likely solve anyone else's issues.
Other modifications include:
Using opposite tokens
For example: Matching a digit
Original regex: \d
Opposite regex \D
For example: Matching a digit or whitepace
Original regex: [\d\s]
Opposite regex: [^\d\s]
Note [\D\S] is incorrect as it matches both sets of characters, thus, any non-whitespace or non-digit character (since non-whitespace includes digits and non-digits include whitespace, both will be matched)
Negative lookaheads
For example: Catching up to 31 days in a month
Original regex \b(?:[0-2]?\d|3[01])\b
Opposite regex: \b(?![0-2]?\d\b|3[01]\b)\d+\b
Code
First, creating a more correct regex that also ensures 0 < x <= 90 as per the OP's question.
^(?:(?:[0-8]?\d|90) ){3}[AB]$
See regex in use here
^(?:(?:(?:[0-8]?\d|90) |(\S*) ?)){3}(?:[AB]|(.*))$
Note: This regex uses the mi flags (multiline - assuming input is in that format, and case-insensitive)
Other Formats
Realistically, this following regex would be ideal. Unfortunately, JavaScript doesn't support some of the tokens used in the regex, but I feel it may be useful to the OP or other users that see this question.
See regex in use here
^(?:(?:(?:[0-8]?\d|90) |(?<n>\S*?) |(?<n>\S*?) ?)){3}(?:(?<n>\S*) )?(?:[AB]|(.*))$
Results
Input
The first section (sections separated by the extra newline/break) shows valid strings, while the second shows invalid strings.
0 45 90 A
0 45 90 B
-1 45 90 A
0 45 91 A
12 34 5.6 A
12 34 56 C
1A 23 45 6
11 1A 12 12 A
12 12 A
12 12 A
Output
0 45 90 A VALID
0 45 90 B VALID
-1 45 90 A INVALID: -1
0 45 91 A INVALID: 91
12 34 5.6 A INVALID: 5.6
12 34 56 C INVALID: C
1A 23 45 6 INVALID: 1A, 6
11 1A 12 12 A INVALID: 12 A
12 12 A INVALID: (missing value)
12 12 A INVALID: A, (missing value)
Note: The last entry shows an odd output, but that's due to a limitation with JavaScript's regex engine. The Other Formats section describes this and another method to use to properly catch these cases (using a different regex engine)
Explanation
This uses a simple | (OR) and captures the incorrect matches into a capture group.
^ Assert position at the start of the line
(?:(?:(?:[0-8]?\d|90) |(\S*) ?)){3} Match the following exactly 3 times
(?:(?:[0-8]?\d|90) |(.+)) Match either of the following
(?:[0-8]?\d|90) Match either of the following, followed by a space character literally
[0-8]?\d Match between zero and one of the characters in the set 0-8 (a digit between 0 and 8), followed by any digit
90 Match 90 literally
(\S*) ? Capture any non-whitespace character one or more times into capture group 1, followed by zero or one space character literally
(?:[AB]|(.*)) Match either of the following
[AB] Match any character present in the set (A or B)
(.*) Capture any character any number of times into capture group 2
$ Assert position at the end of the line
Answer 2
Brief
This method splits the string on the given delimiter and tests each section for the proper set of characters. It outputs a message if the value is incorrect. You would likely replace the console outputs with whatever logic you want use.
Code
var arr = [
"0 45 90 A",
"0 45 90 B",
"-1 45 90 A",
"0 45 91 A",
"12 34 5.6 A",
"12 34 56 C",
"1A 23 45 6",
"11 1A 12 12 A",
"12 12 A",
"12 12 A"
];
arr.forEach(function(e) {
var s = e.split(" ");
var l = s.pop();
var numElements = 3;
var maxNum = 90;
var syntaxErrors = [];
if(s.length != numElements) {
syntaxErrors.push(`Invalid number of elements: Number = ${numElements}, Given = ${s.length}`);
}
s.forEach(function(v) {
if(v.match(/\D/)) {
syntaxErrors.push(`Invalid value "${v}" exists`);
} else if(!v.length) {
syntaxErrors.push(`An empty value or double space exists`);
} else if(Number(v) > maxNum) {
syntaxErrors.push(`Value greater than ${maxNum} exists: ${v}`);
}
});
if(l.match(/[^AB]/)) {
syntaxErrors.push(`Last element ${l} in "${e}" is invalid`);
}
if(syntaxErrors.length) {
console.log(`"${e}" [\n\t${syntaxErrors.join('\n\t')}\n]`);
} else {
console.log(`No errors found in "${e}"`);
}
});

Global regex matching stopping mid-string

I'm trying to extract groups of numbers from a string.
These numbers can either be on their own or as a range in the format \d+ - \d+, while the range indicator between the two numbers can vary and the numbers can either have the prefix M- or STR . These groups can occur 1 to n times in a given string, but the matching should stop if a group is followed by any character that is not a number, whitespace or one of the prefixes mentioned above, even if further numbers can be found afterwards.
As an example, the following lines
01
05,07
05, 7
M-01, M-12
311,STR 02
M-56
STR 17
01 - Random String 25-31 Random other string
M-04 Random String 01
M-17,3,148,14 to 31
M-17,3,STR 148,14 to 31 - Random String
M-17,3,148,14- 31 Random, String 02 Random, other string
STR 17,3,12 to 18, 148 ,M-14- 31 : Random String 02
Should return
01
05;07
05;7
01;12
311;02
56
17
01
04
17;3;148;14 to 31
17;3;148;14 to 31
17;3;148;14- 31
17;3;12 to 18;148;14- 31
I'm using javascript and can almost get a correct result by running
var pattern = /(\d+)\s?(?:-|~|to)?\s?(\d+)?/ig
while (result = pattern.exec(line)) {console.log(result)}
but I can't figure out how to not match numbers after the first string, i.e. M-17,3,148,14 to 31 - Random string 46 Random string would return the values 17;3;148;14 to 31;46, while 46 should not be matched.
I'm not really concerned over the format of the results since I'm sanitizing them anyway atferwards, so it doesn't matter if '03 ' comes back as '03' or '03 '. This is also true for number ranges, 15 - 17 can either be returned as 15 - 17 or, like in the example above, use capturing groups to determine the upper and lower bound, but I still need be able to tell if two numbers are separate or a range, so 5,8,10-12 can't be returned as 5;8;10;12.
My ultimate goal is to extract all possible values in each line. After I extracted all number ranges, I loop through each result to get all possible values, e.g. 5,8,10-12 would become 5;8;10;11;12.
If it is somehow possible, and this is purely optional, I'd also like to preserve the string after the last number range, e.g. STR 14, 23 Some String 18 Some other string should return in 14;23 and separately Some String 18 Some other string.
I'd be grateful if anybody has an idea on how to solve this.

Here's my attempt.
[
'01',
'05,07',
'05, 7',
'M-01, M-12',
'311,STR 02',
'M-56',
'STR 17',
'01 - Random String 25-31 Random other string',
'M-04 Random String 01',
'M-17,3,148,14 to 31',
'M-17,3,STR 148,14 to 31 - Random String',
'M-17,3,148,14- 31 Random, String 02 Random, other string',
'STR 17,3,12 to 18, 148 ,M-14- 31 : Random String 02',
'14 ~ 16',
'Random String 15',
'1to3',
'M-01 to STR 6',
'17 56'
].forEach(function(str) {
var rangeRe = /(?:\s*,\s*)(?:M-|STR )?(\d+)(?:\s*(?:-|~|to)\s*(\d+))?/g,
ranges = [],
lastIndex = 1,
match;
str = ',' + str;
while (match = rangeRe.exec(str)) {
// Push a lower and upper bound onto the list of ranges
ranges.push([+match[1], +(match[2] || match[1])]);
lastIndex = rangeRe.lastIndex;
}
// Log the original string, the ranges and the remainder
console.log([
str.slice(1),
ranges.map(function(pair) {
return pair[0] + '-' + pair[1];
}).join(' ; '),
str.slice(lastIndex)
]);
});
Here are the rules I've followed:
Numbers consists of consecutive digits.
A range consists of either a single number or a pair of numbers.
If a range features a pair they can be separated by -, ~ or to, plus arbitrary whitespace either side the separator.
A range (note range, not number) can be prefixed by M- or STR. No extra whitespace is permitted between the prefix and the range.
Ranges are separated by , plus arbitrary whitespace either side of the ,.
Each range is parsed into an array pair consisting of the lower and upper bound. For a single-number range the same value is used for both bounds.
I've used the statefulness of exec. Each iteration of the loop begins matching where the previous match left off. The lastIndex is tracked so that we can generate the remaining 'random string' at the end.
I add a , out the front of the string before I start. This allows the RegExp to assume that all ranges start with a ,, avoiding the need for a special case of the first range.
A key difference from some of the RegExps that you posted was that I made the entire 'range separator and upper bound' section optional as a unit, rather than making them individually optional. The result of this is that an input like 17 56 would be treat the 56 as the 'random string' and not as an upper bound. The range would be treated as 17-17.

So, after getting a coffee I think I figured out a something close to a solution:
function extractNumbers(line){
var str = line.replace(/(?:M-\s?|STR )(\d+)/ig,'$1')
var rightpart = str.match(/([a-x].*)/i)
var leftpart = str.replace(rightpart[1],'')
var pattern = /(\d+)\s?(?:-|~|to)?\s?(\d+)?/ig
while (result = pattern.exec(leftpart)) {console.log(result)}
console.log(rightpart[1])
}
This function outputs all number ranges and then the rest of the string to the console. There are chances for false positives because it first replaces all occurrences of M- and STR followed by a number, even if they occur in the right part of the string. The chances of this exact sequence of characters occurring in the right part are probably small, but still..
If anybody has an answer to the original question or an idea on how to eliminate the chance for false positives, I would love to see it.

Regex match number between 1 and 31 with or without leading 0

I want to setup some validation on an <input> to prevent the user from entering wrong characters. For this I am using ng-pattern. It currently disables the user from entering wrong characters, but I also noticed this is not the expected behavior so I am also planning on creating a directive.
I am using
AngularJS: 1.6.1
What should the regex match
Below are the requirements for the regex string:
Number 0x to xx (example 01 to 93)
Number x to xx (example 9 to 60)
Characters are not allowed
Special characters are not allowed
Notice:
the 'x' is variable and could be any number between 0 and 100.
The number on the place of 'x' is variable so if it is possible to create a string that is easily changeable that would be appreciated!
What I tried
A few regex strings I tried where:
1) ^0*([0-9]\d{1,2})$
--> Does match 01 but not 1
--> Does match 32 where it shouldn't
2) ^[1-9][0-9]?$|^31$
--> Does match 1 but not 01
--> Does match 32 where it shouldn't
For testing I am using https://regex101.com/tests.
What am I missing in my attempts?

If your aim is to match 0 to 100, here's a way, based on the previous solution.
\b(0?[1-9]|[1-9][0-9]|100)\b
Basically, there's 3 parts to that match...
0?[1-9] Addresses numbers 1 to 9, by mentionning that 0 migh be present
[1-9][0-9] covers number 10 to 99, the [1-9] representing the tens
100 covers for 100
Here's an example of it
Where you to require to set the higher boundary to 42, the middle part of the expression would become [1-3][0-9] (covering 10 to 39) and the last part would become 4[0-2] (covering 40 to 42) like so:
\b(0?[1-9]|[1-3][0-9]|4[0-2])\b

This should work:
^(0?[1-9]|[12][0-9]|3[01])$
https://regex101.com/r/BYSDwz/1

Regular Expression Phone Number Validation

I wrote this regular expression for the Lebanese phone number basically it should start with
00961 or +961 which is the international code then the area code which
could be either any digit from 0 to 9 or cellular code "70" or "76" or
"79" then a 6 digit number exactly
I have coded the following reg ex without the 6 digit part :
^(([0][0]|[+])([9][6][1])([0-9]{1}|[7][0]|[7][1]|[7][6]|[7][8]))$
when i want to add code to ensure only 6 digits more are allowed to the expression:
^(([0][0]|[+])([9][6][1])([0-9]{1}|[7][0]|[7][1]|[7][6]|[7][8])([0-9]{6}))$
It Seems to accept 5 or 6 digits not 6 digits exactly
i am having difficulty finding whats wrong

use this regex ((00)|(\+))961((\d)|(7[0168]))\d{6}

Ths is what I would use.
/^(00|\+)961(\d|7[069])\d{6}$/
00 or +
961
a 1-digit number or 70 or 76 or 79
a 6-digit number

The [0-9]{1} will match also the cellular codes 7x since 7 is between 0 and 9. This means that a "5 digit cellular number" will match on a 7 and six more digits.

Try
/^(00961|\+961)([0-9]|70|76|79)\d{6}$/.test( phonenumber );
//^ start of string
// ^^^^^^^^^^^^^ 00961 or +0961
// ^^^^^^^^^^^^^^^^ a digit 0 to 9 or 70 or 76 or 79
// ^^^^^ 6 digits
// ^ end of string

The cellar code is forming a trap, as #ellak points out:
/^((00)|(\+))961((\d)|(7[0168]))\d{6}$/.test("009617612345"); // true
Here the code should breaks like this: 00 961 76 12345,
but the RegEx practically breaks it like this: 00 961 7 612345, because 7 is matched in \d, and the rest is combined, exactly in 6 digits, and matched.
I'm not sure if this is actually valid, but I guess this is not what you want, otherwise the RegEx in your question should work.
Here's a kinda long RegEx that avoids the trap:
/^(00|\+)961([0-68-9]\d{6}|7[234579]\d{5}|7[0168]\d{6})$/
A few test result:
/(00|\+)961([0-68-9]\d{6}|7[234579]\d{5}|7[0168]\d{6})/.test("009617012345")
false
/(00|\+)961([0-68-9]\d{6}|7[234579]\d{5}|7[0168]\d{6})/.test("009618012345")
true
/(00|\+)961([0-68-9]\d{6}|7[234579]\d{5}|7[0168]\d{6})/.test("009617612345")
false
/(00|\+)961([0-68-9]\d{6}|7[234579]\d{5}|7[0168]\d{6})/.test("0096176123456")
true

Just recently, the Lebanese Ministry of Telecommunication has changed area codes on the IMS. So the current Regex matcher becomes:
^(00|\+)961[ -]?(2[1245789]|7[0168]|8[16]|\d)[ -]?\d{6}$
Prefix: 00 OR +
Country code: 961
Area code: 1-digit or 2-digits; including 2*, 7*, 8*..., OR a single digit for Ogero numbers on the old IMS network starting with 0*, and finally older mobile lines starting with 03.
The 6-digit number
News on the961.com

Develop Reference

JavaScript is the programming language of the Web.

What is the regex that properly splits SVG 'd' attributes into tokens? - javascript

You could go for -?\d+|[A-Z] See a demo on regex101.com. Here, instead of splitting, you could very well just match them: matches = "M 2 -12 C 5 15 21 19 27 -2 C 17 12 -3 40 5 7".match(/-?\d+|[A-Z]/g) # matches holds the different tokens

Related

Regex to match numbers from large document in Javascript

Regex to select all characters that do not match a pattern

Global regex matching stopping mid-string

Regex match number between 1 and 31 with or without leading 0

Regular Expression Phone Number Validation

Categories

Resources