Regex for custom decimal and thousand separator - javascript

I am using the below regex the handle the custom thousand separator which could be any of the , or . or space character which works for the thousand separator and not for the decimal indicator.
I am trying to add a new capturing group to handle decimal indicator (, or .) with maximum 2 decimals but the regex breaks for thousand separator with it.
^[+]?(?:\d{1,3}(?:(,|.| )\d{3})*|\d+)?,?$
How to add a capturing group to handle decimal with custom character? Any Ideas?
Valid Inputs:
1234
123.45
123,45
1234.56
1234,56
123
1,234
12,345
1,234,567
12,345,678
123,456,789
12
1.234
12.345
1.234.567
12.345.678
123.456.789
123
1 234
12 345
123 456
1 234 567
12 345 678
123 456 789
123.4567
123,4567
1,345.67
1.345,67
1 345.67
12,345.67
12.345,67
12 345.67
123,456,789.34
123.456.789,34
123 456 789.34
Not Valid:
12.345.67
12,345,67
12 345 67
123 456 789 34

Well, your specification is ambiguous, as accepting the decimal indicator as ',' you are allowing to parse 123,456 as the number 123456 or as the number 123.456 (one thousandth of it)? If you fix the ambiguity disallowing only a number of three decimals, you solve the ambiguity, but at a high cost, you need the user to understand that if he makes the mistake of using three decimals, he/she will obtain weird results under strange conditions (123,456 will be parsed as 123456.0 while 123,4560will do as 123.456) This is weird for a user to accept. It's more interesting to use the condition that a single , or . means a decimal point, while if you have both indicators, the first will be a group separator, while the second will be a decimal point.
IMHO I should never use the space as a decimal indicator (if using it as a group separator, just use it as the only digit group separator ---some programming languages e.g. Java, allow for _ to be used as a digit group separator), just nobody uses it. It's preferable to use no decimal indicator at all (making the number an integer, scaled 10, 100, or 1000 times, this has been used for long in desktop calculators) as quick data input people prefer to key the extra zeros, than to move the finger to locate de decimal point and then type two more digits for the most of the times. Don't say then if he has to go to the letters keyboard to find the space bar. (well, of course it is more difficult to go there to find the underscore _ char, but quick typers don't use group separators)
In other side, people normally don't key the thousands separators, but just for readability (the computers do it in printing, but never on reading). In this scenario, sometimes they want not the rigid situation of having groups of three digits, but to use them arbitrarily. This leads to some situations where the user wants to separate digits in groups of three left of the decimal point, while using groups of five or ten one the right (which is something you don't contemplate at all) making, e.g. PI to appear as:
3.14159 26535 89793 23846 264338 3
I agree that using the alternate decimal point as group separator could be interesting, but at both sides of the actual decimal point, and never forcing groups of three.
Anyway, just to fit on your specs, I've written the following lex(1) specification to parse your input.
pfx [1-9][0-9]?[0-9]?
grp [0-9][0-9][0-9]
dec [0-9]*
e1 [+-]?{pfx}([.]{grp})*([,]{dec})?
e2 [+-]?{pfx}([,]{grp})*([.]{dec})?
e3 [+-]?{pfx}([ ]{grp})*([.,]{dec})?
e4 [+-]?[1-9][0-9]*([,.]{dec})?
e5 [+-]?0?([,.]{dec})?
%%
{e1}|{e2}|{e3}|{e4}|{e5} printf("\033[32m[%s]\033[m\n", yytext);
[0-9., +-]* printf("\033[31m[%s]\033[m\n", yytext);
. |
\n |
\t ;
%%
int main()
{
yylex();
}
int yywrap()
{
return 1;
}
Your regular expression, complete, should be something like:
[+-]?[0-9]{1,3}([ ][0-9]{3})*([,.]([0-9]{3}[ ])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([ ][0-9]{3})*([,.][0-9]{0,2})?|[+-]?[0-9]{0,2}[,.]([0-9]{3}[ ])*[0-9]{1,3}|[+-]?[0-9]{1,3}([,][0-9]{3})*([.]([0-9]{3}[,])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([,][0-9]{3})*([.][0-9]{0,2})?|[+-]?[0-9]{0,2}[.]([0-9]{3}[,])*[0-9]{1,3}|[+-]?[0-9]{1,3}([.][0-9]{3})*([,]([0-9]{3}[.])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([.][0-9]{3})*([,][0-9]{0,2})?|[+-]?[0-9]{0,2}[,]([0-9]{3}[.])*[0-9]{1,3}|[+-]?[0-9]*[,.][0-9]+|[+-]?[0-9]+[,.][0-9]*|[+-]?[0-9]+
Note
Some regexp libraries, don't implement correctly the | operator, making it not actually conmutative as it should be (the worst case I know is regex101.com, see below), and forcing you to put the operands in some particular order to match some strings (this is a bug in the library, but unfortunately, this is spread) Below is the above (which works fine with sed(1)) and you'll see how it doesn't match correctly in reg101 (There should be far less matches).
I've written also a bash script (shown below) to use sed(1) with the above regexp, so you can see how it works at your site:
dig="[0-9]"
af0="${dig}{0,2}"
af1="${dig}{1,3}"
grp="${dig}{3}"
t01="[+-]?${af1}([ ]${grp})*([,.](${grp}[ ])*${af1})?"
t02="[+-]?${af1}([ ]${grp})*([,.]${af0})?"
t03="[+-]?${af0}[,.](${grp}[ ])*${af1}"
t04="[+-]?${af1}([,]${grp})*([.](${grp}[,])*${af1})?"
t05="[+-]?${af1}([,]${grp})*([.]${af0})?"
t06="[+-]?${af0}[.](${grp}[,])*${af1}"
t07="[+-]?${af1}([.]${grp})*([,](${grp}[.])*${af1})?"
t08="[+-]?${af1}([.]${grp})*([,]${af0})?"
t09="[+-]?${af0}[,](${grp}[.])*${af1}"
t10="[+-]?${dig}*[,.]${dig}+"
t11="[+-]?${dig}+[,.]${dig}*"
t12="[+-]?${dig}+"
s01="${t01}|${t02}|${t03}"
s02="${t04}|${t05}|${t06}"
s03="${t07}|${t08}|${t09}"
s04="${t10}|${t11}|${t12}"
reg="${s01}|${s02}|${s03}|${s04}"
echo "$reg"
sed -E -e "s/${reg}/<&>/g"
You can find all this code (and updates) here.

The following regex will match all the cases from your example:
^[+]?(?:\d{1,3}(?:([,. ])\d{3})*|\d+)?(?:[,.]\d+?){0,1}$
The last part (?:[,.]?\d+?){0,1}, makes the matching of the decimal part optional.

There you go:
^[+]?(?:\d{1,3}(?:(,|.| )\d{3})*|\d+)?((?<!,\d{3})(,\d+)|(?<!\.\d{3})(\.\d+))?$
Regex 101 demo

Assuming
123.4567
123,4567
123 4567
are not valid, you can use:
^[+-]?(?:(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d\d)?|(?:\d{1,3}(?:\.\d{3})*|\d+)(?:,\d\d)?|(?:\d{1,3}(?: \d{3})*|\d+)(?:[,.]\d\d)?)$
Demo & explanation

Related

Need regex for mobile number start with 61 to 99 with 10 digit number

Requirement is mobile number should start with 61 to 99
like 61xxxxxxxx, 62xxxxxxxxx... , 99xxxxxxxxxx
Need regular expression to match this case.
If mobile no is start with 0 or 11,12 or anything less than 61 then it should be invalid
Mobile no is max 10 digits, no country code needed.
You're probably better off using whatever programming tool you have to evaluate whether the first 2 digits are in range, far simpler and probably performant too. However, if you strictly want to use regex, this will do-
(?:6[1-9]|[7-9][0-9])\d{8}$
Here's the demo
It essentially, checks the first digit, if it's a 6, the next digit should be in range [1-9], if it's a 7, 8 or 9 (i.e range [7-9]), the next digit can be in range [0-9]. Then there should be 8 digits that follow.
Ofcourse, this above is a simple and easy to understand solution. Essentially checking each first digit and then matching the next. However if your regex flavor supports negative lookbehind, you could probably shorten this a bit more (sacrificing readability for brevity) but I do prefer this.
You could generate the prefix for the numbers and add a pattern for the remaining 8 digits.
Something like this
const regexp = new RegExp('('+[...Array(39).keys()].map(key => key + 61).join('|') + ')\\d{8,8}')

How do I combine 2 regex patterns into 1 and use it within a function

I have a regEx for checking a number is less than 15 significant figures, Borrowed from this SO answer
/^-?(?=\d{1,15}(?:[.,]0+)?0*$|(?:(?=.{1,16}0*$)(?:\d+[.,]\d+)‌​)).+$/
The the other is used to check that same number is upto 2 decimal places(truncate)
/^-?(\d*\.?\d{0,2}).*/
I have almost 0 regex skill.
Question: How do I combine the 2 regexes to do the work of both, AND not just either OR( accomplished by | character - i am not sure if it achieves same function as combining both)
something like:
/^-?(?=\d{1,15}(?:[.,]0+)?0*$|(?:(?=.{1,16}0*$)(?:\d+[.,]\d+)‌​)).+$ <AND&&NOTOR>(\d*\.?\d{0,2}).*/
Thanks in advance
EDIT: edit moved to a seperate SO question
If you add only one condition of maximum 2 decimal places to first regex, try this..
^-?(?=\d{1,15}(?:[.,]0+)?0*$|(?:(?=[,.\d]{1,16}0*$)(?:\d+[.,]\d{1,2}$))).+$
Demo,,, in which I only changed original \d+ to d{1,2}$
Edited for the reguest to extract 15 significant figures and capture group 1 ($1). Try this which is wrapped to capture group 1 ($1) and limited 15 significant figures to be extracted easily.
^(-?(?=\d{1,15}(?:[.,]0+)?0*$|(?:(?=[,.\d]{1,16}0*$)(?:\d+[.,]\d{1,2}$))).{1,16}).*$
Demo,,, in which changed to .{1,16} from .+$.
If the number matches, then able to be replaced $1, but if not so, replaced nothing, thus remains original unmatched number.
Therefore, if you want to extract 15 significant figures by replacing with $1 only when your condition is satisfied, try this regex to your function.
^(-?(?=\d{1,15}(?:[.,]0+)?0*$|(?:(?=[,.\d]{1,16}0*$)(?:\d+[.,]\d{1,2}$))).{1,16}).*$|^.*$
Demo,,, in which all numbers are matched, but only the numbers satisfying your condition are captured to $1 in format of 15 significant figures.

RegEx to filter out all but one decimal point [duplicate]

i need a regular expression for decimal/float numbers like 12 12.2 1236.32 123.333 and +12.00 or -12.00 or ...123.123... for using in javascript and jQuery.
Thank you.
Optionally match a + or - at the beginning, followed by one or more decimal digits, optional followed by a decimal point and one or more decimal digits util the end of the string:
/^[+-]?\d+(\.\d+)?$/
RegexPal
The right expression should be as followed:
[+-]?([0-9]*[.])?[0-9]+
this apply for:
+1
+1.
+.1
+0.1
1
1.
.1
0.1
Here is Python example:
import re
#print if found
print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))
#print result
print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))
Output:
True
1.0
If you are using mac, you can test on command line:
python -c "import re; print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))"
python -c "import re; print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))"
You can check for text validation and also only one decimal point validation using isNaN
var val = $('#textbox').val();
var floatValues = /[+-]?([0-9]*[.])?[0-9]+/;
if (val.match(floatValues) && !isNaN(val)) {
// your function
}
This is an old post but it was the top search result for "regular expression for floating point" or something like that and doesn't quite answer _my_ question. Since I worked it out I will share my result so the next person who comes across this thread doesn't have to work it out for themselves.
All of the answers thus far accept a leading 0 on numbers with two (or more) digits on the left of the decimal point (e.g. 0123 instead of just 123) This isn't really valid and in some contexts is used to indicate the number is in octal (base-8) rather than the regular decimal (base-10) format.
Also these expressions accept a decimal with no leading zero (.14 instead of 0.14) or without a trailing fractional part (3. instead of 3.0). That is valid in some programing contexts (including JavaScript) but I want to disallow them (because for my purposes those are more likely to be an error than intentional).
Ignoring "scientific notation" like 1.234E7, here is an expression that meets my criteria:
/^((-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
or if you really want to accept a leading +, then:
/^((\+|-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
I believe that regular expression will perform a strict test for the typical integer or decimal-style floating point number.
When matched:
$1 contains the full number that matched
$2 contains the (possibly empty) leading sign (+/-)
$3 contains the value to the left of the decimal point
$5 contains the value to the right of the decimal point, including the leading .
By "strict" I mean that the number must be the only thing in the string you are testing.
If you want to extract just the float value out of a string that contains other content use this expression:
/((\b|\+|-)(0|([1-9][0-9]*))(\.[0-9]+)?)\b/
Which will find -3.14 in "negative pi is approximately -3.14." or in "(-3.14)" etc.
The numbered groups have the same meaning as above (except that $2 is now an empty string ("") when there is no leading sign, rather than null).
But be aware that it will also try to extract whatever numbers it can find. E.g., it will extract 127.0 from 127.0.0.1.
If you want something more sophisticated than that then I think you might want to look at lexical analysis instead of regular expressions. I'm guessing one could create a look-ahead-based expression that would recognize that "Pi is 3.14." contains a floating point number but Home is 127.0.0.1. does not, but it would be complex at best. If your pattern depends on the characters that come after it in non-trivial ways you're starting to venture outside of regular expressions' sweet-spot.
Paulpro and lbsweek answers led me to this:
re=/^[+-]?(?:\d*\.)?\d+$/;
>> /^[+-]?(?:\d*\.)?\d+$/
re.exec("1")
>> Array [ "1" ]
re.exec("1.5")
>> Array [ "1.5" ]
re.exec("-1")
>> Array [ "-1" ]
re.exec("-1.5")
>> Array [ "-1.5" ]
re.exec(".5")
>> Array [ ".5" ]
re.exec("")
>> null
re.exec("qsdq")
>> null
For anyone new:
I made a RegExp for the E scientific notation (without spaces).
const floatR = /^([+-]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:[eE][+-]?[0-9]+)?)$/;
let str = "-2.3E23";
let m = floatR.exec(str);
parseFloat(m[1]); //=> -2.3e+23
If you prefer to use Unicode numbers, you could replace all [0-9] by \d in the RegExp.
And possibly add the Unicode flag u at the end of the RegExp.
For a better understanding of the pattern see https://regexper.com/.
And for making RegExp, I can suggest https://regex101.com/.
EDIT: found another site for viewing RegExp in color: https://jex.im/regulex/.
EDIT 2: although op asks for RegExp specifically you can check a string in JS directly:
const isNum = (num)=>!Number.isNaN(Number(num));
isNum("123.12345678E+3");//=> true
isNum("80F");//=> false
converting the string to a number (or NaN) with Number()
then checking if it is NOT NaN with !Number.isNaN()
If you want it to work with e, use this expression:
[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?
Here is a JavaScript example:
var re = /^[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?$/;
console.log(re.test('1'));
console.log(re.test('1.5'));
console.log(re.test('-1'));
console.log(re.test('-1.5'));
console.log(re.test('1E-100'));
console.log(re.test('1E+100'));
console.log(re.test('.5'));
console.log(re.test('foo'));
Here is my js method , handling 0s at the head of string
1- ^0[0-9]+\.?[0-9]*$ : will find numbers starting with 0 and followed by numbers bigger than zero before the decimal seperator , mainly ".". I put this to distinguish strings containing numbers , for example, "0.111" from "01.111".
2- ([1-9]{1}[0-9]\.?[0-9]) : if there is string starting with 0 then the part which is bigger than 0 will be taken into account. parentheses are used here because I wanted to capture only parts conforming to regex.
3- ([0-9]\.?[0-9]): to capture only the decimal part of the string.
In Javascript , st.match(regex), will return array in which first element contains conformed part. I used this method in the input element's onChange event , by this if the user enters something that violates the regex than violating part is not shown in element's value at all but if there is a part that conforms to regex , then it stays in the element's value.
const floatRegexCheck = (st) => {
const regx1 = new RegExp("^0[0-9]+\\.?[0-9]*$"); // for finding numbers starting with 0
let regx2 = new RegExp("([1-9]{1}[0-9]*\\.?[0-9]*)"); //if regx1 matches then this will remove 0s at the head.
if (!st.match(regx1)) {
regx2 = new RegExp("([0-9]*\\.?[0-9]*)"); //if number does not contain 0 at the head of string then standard decimal formatting takes place
}
st = st.match(regx2);
if (st?.length > 0) {
st = st[0];
}
return st;
}
Here is a more rigorous answer
^[+-]?0(?![0-9]).[0-9]*(?![.])$|^[+-]?[1-9]{1}[0-9]*.[0-9]*$|^[+-]?.[0-9]+$
The following values will match (+- sign are also work)
.11234
0.1143424
11.21
1.
The following values will not match
00.1
1.0.00
12.2350.0.0.0.0.
.
....
How it works
The (?! regex) means NOT operation
let's break down the regex by | operator which is same as logical OR operator
^[+-]?0(?![0-9]).[0-9]*(?![.])$
This regex is to check the value starts from 0
First Check + and - sign with 0 or 1 time ^[+-]
Then check if it has leading zero 0
If it has,then the value next to it must not be zero because we don't want to see 00.123 (?![0-9])
Then check the dot exactly one time and check the fraction part with unlimited times of digits .[0-9]*
Last, if it has a dot follow by fraction part, we discard it.(?![.])$
Now see the second part
^[+-]?[1-9]{1}[0-9]*.[0-9]*$
^[+-]? same as above
If it starts from non zero, match the first digit exactly one time and unlimited time follow by it [1-9]{1}[0-9]* e.g. 12.3 , 1.2, 105.6
Match the dot one time and unlimited digit follow it .[0-9]*$
Now see the third part
^[+-]?.{1}[0-9]+$
This will check the value starts from . e.g. .12, .34565
^[+-]? same as above
Match dot one time and one or more digits follow by it .[0-9]+$

Expression regular for check phone numbers at word level

I'm trying to write a RegEx to test if a number is valid and for valid I mean any number that matches country calling codes but also where the format of telephone numbers is standardized by ITU-T in the recommendation E.164. This specifies that the entire number should be 15 digits or shorter, and begin with a country prefix as said here so I did this:
^\+\d{2}|\d{3}([0-9])\d{7}$
But it's not working. In my case (VE numbers can't match the RegEx since this one are validated in another way) this input is valid:
+1420XXXXXXXXXXX // Slovakia - X is a digit and could be more, tough, 5 minimum
001420XXXXXXXXXX // Slovakia - I've changed from + to 00
420XXXXXXXXXXXXX // Slovakia - I've removed the 00 o + but number still being valid
+40XXXXXXXXXXXXX // Romania
Invalid numbers are the one that doesn't match the RegEx and the one started with +58 since they are from VE. So, resuming, a valid number should have:
+XX|+XXX plus 12|11 digits (5 minimum) where XX|XXX is the country code and then since maximum is 15 digits then should be 12 or 11 digits depending on the country format
Can any help me with this? It's a one I called complex
Few strange things going on with your regexp:
\d is shorthand for [0-9] - fine to use both, but I'm wondering why they're mixed
what you are searching with you OR (|) is "something that starts with +XX" i.e. plus and two numbers (^\+\d{2}) OR "something that ends with XXXXXXXXXXX" i.e. 11 numbers (\d{3}([0-9])\d{7}$)
You need to group (with brackets) the OR choices, otherwise it is everything to the left or everything to the right (simplistically)
^\+(\d{2}|\d{3})([0-9])\d{7}$
There is, however, another way of giving the number of occurrences : {m,n} means occurs between m and n times. So you could say ^\+\d{7,15}$ (where 7 is your minimum 5 + the minimum country code of 2).
To really do this, however, you might want to take a look here (https://code.google.com/p/libphonenumber/ 1) where there is a complete validation and formatting for all phone numbers available as javascript.

Any way to reliably compress a short string?

I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).
What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.
One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree
Smaz is a simple compression library suitable for compressing very short strings.

Categories

Resources