JS RegEx challenge: replacing recurring patterns of commas and numbers

JS RegEx challenge: replacing recurring patterns of commas and numbers - javascript

I am fiddling with a program to convert Japanese addresses into romaji (latin alphabet) for use in an emergency broadcast system for foreigners living in a Japanese city.
Emergency evacuation warnings are sent out to lists of areas all at once. I would like to be able to copy/paste this Japanese list of areas and spit out the romanized equivalent.
example Japanese input:
3条4～12丁目、15～18条12丁目、2、3条5丁目
(this list is of three areas, where 条(jo) and 丁目(chome) indicate block numbers in north-south and east-west directions, respectively)
The numbers are fine as they are, and I have already written code to replace the characters 条 and 丁目 with their romanized equivalents. My program currently outputs the first two areas (correctly) as "3-jo 4~12-chome" and "15~18-jo 12-chome"
However, I would like to replace patterns like that in the last area "2、5条6丁目" (meaning blocks 2 and 5 of 6-chome) such that the output reads "2&5-jo 6-chome"
The regular expression that denotes this pattern is \d*、\d* (note the Japanese format comma)
I am still getting used to regex - how can I replace the comma found in all \d*、\d* patterns with an "&"? Note that I can't simply replace all commas because they are also used to separate areas.

The easiest way is to isolate sequences like 15、18 and replace all commas in them.
text = "3条4～12丁目、15～18条12丁目、2、3条5丁目";
text.
replace(/(?:\d+、)+\d+/g, function(match) {
return match.replace(/、/g, "&");
}).
replace(/条/g, '-jō ').
replace(/丁目/g, '-chōme').
replace(/～/g, '-').
replace(/、/g, ', ')
// => "3-jō 4-12-chōme, 15-18-jō 12-chōme, 2&3-jō 5-chōme"
(Also... Where the heck do you live that has 丁 well-ordered by cardinal directions? Where I live, addresses are a mess... :P )
(Also also, thanks to sainaen for nitpicking my regexps into perfection :) )

Related

Matching all expressions using JS regex

I need to match all the expression (example: Laugh at Loud (LoL)) with 2 or more than 3 words. My regex works only for text with 3 character long expression. How do I make the regex very generic (without specifying the length as 3) so that expression are selected even if they are of any length.
The link shared provides an overview of it.
The last expression
light amplification by stimulated emission of radiation (LASER)
Green Skill Development Programme (GSDP) are not selected using the below regex
\b(\w)[\w']*[^a-zA-Z()]* (\w)[\w']*[^a-zA-Z()]* (\w)[\w']*[^a-zA-Z()]* \(\1\2\3\)
\b(?:\w[\w']* [^a-zA-Z]*){3} ?\([A-Z]{3}\)
https://regex101.com/r/QPMo5M/1

You can try the following:
/\b(\w)[-'\w]* (?:[-'\w]* ){1,}\(\1[A-Z]{1,}\)/gi
UPDATE
As #ikegami commented, this sloppy regex matches also things like Bring some drinks (beer) and Bring something to put on the grill (BBQ). I think these cases can be filtered by using proper JavaScript code after doing the regex matching. Maybe in case of Bring some drinks (beer), we can detect it by using the fact that (beer) has no uppercase letters. In case of Bring something to put on the grill (BBQ), we can detect it by using the fact that there's no matching initial letters for the second B and Q in Bring something to put on the grill.
UPDATE 2
When we match the following string by using the regex above:
We need to use technologies from Natural Language Processing (NLP).
It matches "need to use technologies from Natural Language Processing (NLP)", not "Natural Language Processing (NLP)". These problems should be tackled also.
UPDATE 3
The following regex matches acronyms whose length is from 2 to 5 and it doesn't have the issues mentioned above. And I think it can be quite easily extended to support longer length as you want:
/\b(\w)\S* (?:(?:by |of )?(\w)\S* (?:(?:by |of )?(\w)\S* (?:(?:by |of )?(\w)\S* (?:(?:by |of )?(\w)\S* )?)?)?) *\(\1\2\3\4\5\)/gi

\b(\w)[-'\w]* (?:[-`."?,~=#!/\\|+:;%°*#£&^€$¢¥§'\w]* ){2,}\(\1[A-Z]{2,}\)
I placed some special characters in between

Mixing european numbers and units with persian text

I am working on language mutation of frontend of our app for Iranian market, but currently we got stuck with issue with mixing europian numerals and units with persian texts. For example our desired format for one text is:
Result: <value>s (e.g. Result: 50s)
But when I try to compose this string in javascript, number 50 is before text (after in persian language) like this
50 :persiantext s
Is there any solution how to mix these things together, or it doesn't even make any sense to mix it and it all should be in persian?
Thank you for all help/suggestions.

Use placeholders in your text and then replace it with the number. If you have
Result: {$d}s
for english, and
{$d}s :نتیجه
for persian, then you can replace {$d} with 50 and get the correct text in both cases.
There are a few libraries, which could help replacing variables like Underscore or Lodash (though they use a slightly different syntax for the variables), see template function.

When concatenating mixed RTL/LTR text you should use corresponding Right-to-left and Left-to-right marks (as shown in this Java question - String concatenation containing Arabic and Western characters).
myArabicString + "\u202A" + englishNumberAndText + "\u202C" + moreArabic
Alternatively most RTL languages have native numbers that flow RTL too. To use that approach you will need to write own code to replace each individual digits - similar to code in convert numbers from arabic to english.
Mixing in Latin punctuation like ":" and "!" need to be done carefully - you may need to wrap it to RTL/LTR marks - but make sure to review results with people who actually know how text should look like.
Side note: you may want to check out JavaScript equivalent to printf/string.format if you need a lot of formatting.

Phone number validation - excluding non repeating separators

I have the following regex for phone number validation
function validatePhonenumber(phoneNum) {
var regex = /^[1-9]{3}[-\s\.]{0,1}[0-9]{3}[-\s\.]{0,1}[0-9]{4}$/;
return regex.test(phoneNum);
}
However, I would liek to make sure it doesn;t pass for different separators such as in
111-222.3333
Any ideas how to make sure the separators are the same always?

Just make sure beforehand that there is at most one kind of separator, then pass the string through the regex as you were doing.
function validatePhonenumber(phoneNum) {
var separators = extractSeparators(phoneNum);
if(separators.length > 1) return false;
var regex = /^[1-9]{3}[-\s\.]{0,1}[0-9]{3}[-\s\.]{0,1}[0-9]{3}$/;
return regex.test(phoneNum);
}
function extractSeparators(str){
// Return an array with all the distinct chars
// that are present in the passed string
// and are not numeric (0-9)
}

You can use the following regex instead:
\d{3}([-\s\.])?\d{3}\1?\d{4}
Here is a working example:
http://regex101.com/r/nN9nT7/1
As result it will match the following result:
111-222-3333 --> ok
111.222.3333 --> ok
111 222 3333 --> ok
111-222.3333
111.222-3333
111-222 3333
111 222-3333
EDIT: after Alan Moore's suggestion:
Also matches 111-2223333. That's because you made the \1 optional,
which isn't necessary. One of JavaScript's stranger quirks is that a
backreference to a group that did not participate in the match,
succeeds anyway. So if there's no first separator, ([-\s.])? succeeds
because the ? made it optional, and \1 succeeds because it's
JavaScript. But I would have used ([-\s.]?) to capture the first
separator (which might be nothing), and \1 to match the same thing
again. This works in any flavor, including JavaScript.
We can improve the regex to:
^\d{3}([-\s\.]?)\d{3}\1\d{4}$

You'll need at least two passes to keep this maintainable and extensible.
JS' RegEx doesn't allow for creating variables for use later in the RegEx, if you want to support older browsers.
If you are only supporting modern browsers, Fede's answer is just fine...
As such, with ghetto-support, you aren't going to be able to reliably check that one separator is the same value every time, without writing a really, really, really, stupidly-long RegEx, using | to basically write out the RegEx 3 times.
A better way might be to grab all of the separators, and use a reduction or a filter to check that they all have the same value.
var userEnteredNumber = "999.231 3055";
var validNumber = numRegEx.test(userEnteredNumber);
var separators = userEnteredNumber.replace(/\d+/g, "").split("");
var firstSeparator = separators[0];
var uniformSeparators = separators.every(function (separator) { return separator === firstSeparator; });
if (!uniformSeparators) { /* also not valid */ }
You could make that a little neater, using closures and some applied functions, but that's the idea.
Alternatively, here's the big, ugly RegEx that would allow you to test exactly what the user entered.
var separatorTest = /^([0-9]{3}\.[0-9]{3}\.[0-9]{3,4})|([0-9]{3}-[0-9]{3}-[0-9]{3,4})|([0-9]{3} [0-9]{3} [0-9]{3,4})|([0-9]{9,10})$/;
Notice I had to include the exact same number-test three times, wrap each one in parens (to be treated as a single group), and then separate each group with an | to check each group, like an if, else if, else... ...and then plug in a separate special case for having no separator at all...
...not pretty.
I'm also not using \d, just because it's easy to forget that - and . are both accepted "digit"s, when trying to maintain one of these abominations.
Now, a word or two of warning:
People are liable to enter all kinds of crap; if this is for a commercial site, it's likely better to just strip separators entirely and validate the number is the right size, and conforms to some specifics (eg: doesn't start with /^555555/).
If not given any instruction about number format, people will happily use either no separator or a formal number, like (555) 555-5555 (or +1 (555) 555-5555 for the really pedantic), which is obviously going to fail hard, in this system (see point #1).
Be prepared to trim what you get, before validating.
Depending on your country/region/etc laws about data-security and consumer-vs-transaction record-keeping (again, may or may not be more important in a commercial setting), it's likely better to store both a "user-given" ugly number, and a system-usable number, which you either clean on the back-end, or submit along with the user-entered text.
From a user-interaction perspective, either forcing the number to conform, explicitly (placeholders showing them xxx-xxx-xxxx right above the input, in bold), or accepting any text, and prepping it yourself, is going to be 1000x better than accepting certain forms, but not bothering to tell the user up-front, and instead telling them what they did was wrong, after they try.
It's not cool for relationships; it's equally not cool, here.
You've got 9-digit and 10-digit numbers, so if you're trying for an international solution, be prepared to deal with all international separators (, \.\-\(\)\+) etc... again, why stripping is more useful, because THAT RegEx would be insane.

Match interconnected arabic characters

I need to match interconnected Arabic characters to do expansion like this:
بسم الله الرحمن الرحيم
becomes
بـسـم الـلـه الـرحـمـن الـرحـيـم
is there a way to do that using regular expressions?

How about something like this:
"بسم الله الرحمن الرحيم".replace(/(ب|ت|ث|ج|ح|خ|س|ش|ص|ض|ط|ظ|ع|غ|ف|ق|ك|ل|م|ن|ه|ي)(?=\S)/g, "$1ـ");
returns:
"بـسـم الـلـه الـرحـمـن الـرحـيـم"
Clarification:
We're matching letters that can be interconnected with the proceeding character by doing an OR group between all those characters, then we make sure it's not followed by a white space (not an end of word). then we replace the first matched group (the letter) by itself ($1) followed by an expansion character.

I had a project once in which I had to choose the correct unicode codes to render depending on the position of the letters; so that they appear connected (or disconnected) as appropriate, because I was using a system non-compliant with Unicode.
The unicode values for the disconnected Meem (م) is different than the one that is connected. BUT:
Unfortunately for your case, and most fortunately for many other cases, it is part of the unicode specification that displaying letters be separated from their actual unicode value. This is why you might have the unicode for a disconnected Meem, but it displayed as connected! The specification includes that comparing the connected Meem to a disconnected one always yields the correct value semantically which is true for equivalence. This makes things a lot easier!
What I ended up doing is to create a static data structure (use hard coded dictionaries or arrays) or XML or whatever. This data structure would tell us when each Arabic letter is connected or not (to both after and before).
For example:
//list of chars that can connect before and after
var canConnectBeforeAfter = new List<char>() { 'ع', 'ت', 'ب', 'ي' /*and so on*/ };
//list of chars that can connect only to character before them (of that character can connect to the one after it! watch out for وو)
var cannotConnectAfter = new List<char>() { 'ر', 'و' };
var cannotConnect = new List<char>() { 'ء' });
You will need to add the right characters for the right lists. I hope you don't have to deal with Harakat!!!!
سلام, let me know if you need clarification

JavaScript regex valid name

I want to make a JavaScript regular expression that checks for valid names.
minimum 2 chars (space can't count)
space en some special chars allowed (éàëä...)
I know how to write some seperatly but not combined.
If I use /^([A-Za-z éàë]{2,40})$/, the user could input 2 spaces as a name
If I use /^([A-Za-z]{2,40}[ éàë]{0,40})$/, the user must use 2 letters first and after using space or special char, can't use letters again.
Searched around a bit, but hard to formulate search string for my problem. Any ideas?

Please, please pretty please, don't do this. You will only end up upsetting people by telling them their name is not valid. Several examples of surnames that would be rejected by your scheme: O'Neill, Sørensen, Юдович, 李. Trying to cover all these cases and more is doomed to failure.
Just do something like this:
strip leading and trailing blanks
collapse consecutive blanks into one space
check if the result is not empty
In JavaScript, that would look like:
name = name.replace(/^\s+/, "").replace(/\s+$/, "").replace(/\s+/, " ");
if (name == "") {
// show error
} else {
// valid: maybe put trimmed name back into form
}

Most solutions don't consider the many different names there might be. There can be names with only two character like Al or Bo or someone that writes his name like F. Middlename Lastname.
This RegExp will validate most names but you can optimize it to whatever you want:
/^[a-z\u00C0-\u02AB'´`]+\.?\s([a-z\u00C0-\u02AB'´`]+\.?\s?)+$/i
This will allow:
Li Huang Wu
Cevahir Özgür
Yiğit Aydın
Finlay Þunor Boivin
Josué Mikko Norris
Tatiana Zlata Zdravkov
Ariadna Eliisabet O'Taidhg
sergej lisette rijnders
BRIANA NORMINA HAUPT
BihOtZ AmON PavLOv
Eoghan Murdo Stanek
Filimena J. Van Der Veen
D. Blair Wallace
But will not allow:
Shirley24
66Bryant Hunt88
http://stackoverflow.com
laoise_ibtihaj
hippolyte#example.com
Cy4n 4ur0r4 Blyth3 3ll1
Justisne
Danny
If the name needs to be capitalized, uppercase, lowercase, trimmed or single spaced, that's a task a formatter should do, not the user.

I would like to propose a RegEx that would match all latin based languages with their special characters:
/\A([ áàíóúéëöüñÄĞİŞȘØøğışÐÝÞðýþA-Za-z-']*)\z/
P.S. I've included all characters I could find, but please feel free to edit the answer in case I've missed any.

Why not
var reg= /^([A-Za-z]{2}[ éàëA-Za-z]*)$/;
2 letters, then as many spaces, letters or special characters as you want.
I wouldn't allow spaces in usernames though - it's begging for trouble when you have usernames like
ab ba
who's going to remember how many spaces they used?

You could do this:
/^([A-Za-zéàë]{2,40} ?)+$/
2-40 characters, and then optionally a space, repeated at least once. This will allow a space at the end, but you could trim it off separately.

After 'trim' the input value, The following will math your request only for Latin surnames.
rn = new RegExp("([\w\u00C0-\u02AB']+ ?)+","gi");
m = ln.match(rn);
valid = (m && m.length)? true: false;
Note that I am using '+', instead of '{2,}', that is because some surnames uses just one letter in a separated word like "Ortega y Gasset"
You can see I am not using RegExp.test, this is because that method don't work properly (I don't know why, but it has a high fail-rate, you may see it here:.
In my country, people from non-latin-language countries usually do some translation of their names so the previous RegExp would be enough. However, if you attempt to match any surname in the world, you may add more range of \u#### characters, avoiding to include symbols, numbers or other type. Or perhaps the xregexp library may help you.
And, please, do not forget to test the input in server side, and escaping it before using it in the sql sentences (if you have them)

Develop Reference

JavaScript is the programming language of the Web.

JS RegEx challenge: replacing recurring patterns of commas and numbers - javascript

Related

Matching all expressions using JS regex

Mixing european numbers and units with persian text

Phone number validation - excluding non repeating separators

Match interconnected arabic characters

JavaScript regex valid name

Categories

Resources