For example, if the initial string s is "0123456789", desired output would be an array ["01", "23", "45", "67", "89"].
Looking for elegant solutions in JavaScript.
What I was thinking (very non-elegantly) is to iterate through the string by splitting on the empty string and using the Array.forEach method, and insert a delimeter after every two characters, then split by that delimeter. This is not a good solution, but it's my starting point.
Edit: A RegExp solution has been posted. I'd love to see if there are any other approaches.
How about:
var array = ("0123456789").match(/\w{1,2}/g);
Here we use .match() on your string to match any two or single ({1,2}) word characters (\w) and return an array of the results.
Regarding your edit for a non-regex solution; you could do a far less elegant function like this:
String.prototype.getPairs = function()
{
var pairs = [];
for(var i = 0; i < this.length; i += 2)
{
pairs[pairs.length] = this.substr(i, 2);
}
return pairs;
}
var array = ("01234567890").getPairs();
If you want to use split (and why not), you could do the following:
s.split(/([^][^])/).filter(function(x){return x})
Which splits using two consecutive characters as a delimiter (but because they're in a capture group, they're also part of split's result. Filtering that with the identity function serves to eliminate the empty strings (between the "delimiters"). Note that in the case of an odd number of characters, the last character will be output as a split, not a delimiter, but it doesn't matter since it will still test truthy.
([^] is how you spell . in javascript if you really want to match any character. I had to look that up.)
Related
What is the regular expression to validate a comma delimited list like this one:
12365, 45236, 458, 1, 99996332, ......
I suggest you to do in the following way:
(\d+)(,\s*\d+)*
which would work for a list containing 1 or more elements.
This regex extracts an element from a comma separated list, regardless of contents:
(.+?)(?:,|$)
If you just replace the comma with something else, it should work for any delimiter.
It depends a bit on your exact requirements. I'm assuming: all numbers, any length, numbers cannot have leading zeros nor contain commas or decimal points. individual numbers always separated by a comma then a space, and the last number does NOT have a comma and space after it. Any of these being wrong would simplify the solution.
([1-9][0-9]*,[ ])*[1-9][0-9]*
Here's how I built that mentally:
[0-9] any digit.
[1-9][0-9]* leading non-zero digit followed by any number of digits
[1-9][0-9]*, as above, followed by a comma
[1-9][0-9]*[ ] as above, followed by a space
([1-9][0-9]*[ ])* as above, repeated 0 or more times
([1-9][0-9]*[ ])*[1-9][0-9]* as above, with a final number that doesn't have a comma.
Match duplicate comma-delimited items:
(?<=,|^)([^,]*)(,\1)+(?=,|$)
Reference.
This regex can be used to split the values of a comma delimitted list. List elements may be quoted, unquoted or empty. Commas inside a pair of quotation marks are not matched.
,(?!(?<=(?:^|,)\s*"(?:[^"]|""|\\")*,)(?:[^"]|""|\\")*"\s*(?:,|$))
Reference.
/^\d+(?:, ?\d+)*$/
i used this for a list of items that had to be alphanumeric without underscores at the front of each item.
^(([0-9a-zA-Z][0-9a-zA-Z_]*)([,][0-9a-zA-Z][0-9a-zA-Z_]*)*)$
You might want to specify language just to be safe, but
(\d+, ?)+(\d+)?
ought to work
I had a slightly different requirement, to parse an encoded dictionary/hashtable with escaped commas, like this:
"1=This is something, 2=This is something,,with an escaped comma, 3=This is something else"
I think this is an elegant solution, with a trick that avoids a lot of regex complexity:
if (string.IsNullOrEmpty(encodedValues))
{
return null;
}
else
{
var retVal = new Dictionary<int, string>();
var reFields = new Regex(#"([0-9]+)\=(([A-Za-z0-9\s]|(,,))+),");
foreach (Match match in reFields.Matches(encodedValues + ","))
{
var id = match.Groups[1].Value;
var value = match.Groups[2].Value;
retVal[int.Parse(id)] = value.Replace(",,", ",");
}
return retVal;
}
I think it can be adapted to the original question with an expression like #"([0-9]+),\s?" and parse on Groups[0].
I hope it's helpful to somebody and thanks for the tips on getting it close to there, especially Asaph!
In JavaScript, use split to help out, and catch any negative digits as well:
'-1,2,-3'.match(/(-?\d+)(,\s*-?\d+)*/)[0].split(',');
// ["-1", "2", "-3"]
// may need trimming if digits are space-separated
The following will match any comma delimited word/digit/space combination
(((.)*,)*)(.)*
Why don't you work with groups:
^(\d+(, )?)+$
If you had a more complicated regex, i.e: for valid urls rather than just numbers. You could do the following where you loop through each element and test each of them individually against your regex:
const validRelativeUrlRegex = /^(^$|(?!.*(\W\W))\/[a-zA-Z0-9\/-]+[^\W_]$)/;
const relativeUrls = "/url1,/url-2,url3";
const startsWithComma = relativeUrls.startsWith(",");
const endsWithComma = relativeUrls.endsWith(",");
const areAllURLsValid = relativeUrls
.split(",")
.every(url => validRelativeUrlRegex.test(url));
const isValid = areAllURLsValid && !endsWithComma && !startsWithComma
All I want is to test whether a string contains non-overlapping substrings to match the array of regexes in the following way: if a substring matches some item of the array, remove the corresponding regex from the array, and continue. I will need a function func(arg1, arg2) that will take two arguments: the first one is the string itself, and the second one is an array of regular expressions to test.
I've read some explanations (such as Regular Expressions: Is there an AND operator?), but they do not answer this specific question. For example, in Javascript, the following three code snippets will return true:
/(?=ab)(?=abc)(?=abcd)/gi.test("eabzzzabcde");
/(?=.*ab)(?=.*abc)(?=.*abcd)/gi.test("eabzzzabcde");
/(?=.*?ab)(?=.*?abc)(?=.*?abcd)/gi.test("eabzzzabcde");
which is, obviously, not what I want (because "abc" and "abcd" in "eabzzzabcde" are just mixed together in an overlapping way). So, func("eabzzzabcde", [/ab/gi, /abc/gi, /abcd/gi]) should return false.
But, func("Fhh, fabcw wxabcdy yz... zab.", [/ab/gi, /abc/gi, /abcd/gi]) should return true because none of "ab", "abc" and "abcd" substrings overlap each other. The logic is the following. We have an array of regexes: [/ab/gi, /abc/gi, /abcd/gi], and some possible combination of three (where 3 is equal to the length of that array) non-overlapping, separate substrings of the original string: fabcw, xabcdy and zab. Does fabcw match /abc/gi? Yes. Okay, we remove /abc/gi from the array, and we have [/ab/gi, /abcd/gi] for xabcdy and zab. Does xabcdy match /abcd/gi? Yes. Okay, we remove /abcd/gi from the current array, and we have [/ab/gi] for zab. Does zab. match /ab/gi? Yes. No more regexes left in the current array, and we always answered "yes", so — return true.
The tricky part here is to find an efficient (such that performance is not too terrible) way to get at least one possible “good” combination of non-overlapping substrings.
The more complex case is e.g. func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]). Using the logic described above, we can see that if we take two non-overlapping parts of the original string — "acdxba" (or "cdxba") and "abaac" (or "abaacb", "babaac" etc.) — the first one matches /.*?c.*?b.*?a/gi, and the second one matches /.*?a.*?b.*?c/gi. So, func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]) should return true.
Is there any efficient way to solve such a problem?
Assuming each pattern should match exactly once, then we can construct a regexp of all of their permutations:
const patterns= ['ab', 'abc', 'abcd'];
const input = "Fhh, fabcw wxabcdy yz... zab.";
// Create a regexp of the form
// (.*?ab.*?abc.*?abcd.*?)
function build(patterns) {
return `(${['', ...patterns, ''].join('.*?')})`;
}
function match(input, patterns) {
const regexps = [...permute(patterns)].map(build);
// Create a regexp of the form
// /(.*?ab.*?abc.*?abcd.*?)|(.*?ab.*?abcd.*?abc.*?)|.../
const regexp = new RegExp(regexps.join('|'));
return regexp.test(input);
}
// Simple permutation generator.
function *permute(a, n = a.length) {
if (n <= 1) yield a.slice();
else for (let i = 0; i < n; i++) {
yield *permute(a, n - 1);
const j = n % 2 ? 0 : i;
[a[n-1], a[j]] = [a[j], a[n-1]];
}
}
console.log(match(input, patterns));
This will result in a very long regexp if there are more than a half-dozen or so patterns. To deal with this, we can test each permutation one at a time:
function match(input, patterns) {
return Array.from(permute(patterns))
.some(perm => input.match(build(perm)));
}
If there are ten patterns, we will end up doing a couple million tests.
Disclaimers
This uses ES6 features. Fall back to equivalent ES5 syntax if you need to.
The input patterns here are strings. To handle regexps instead would require a little bit of logic to extract the pattern from the regexp, and also escape any special regexp characters appearing in it.
Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?
I doubt that you would call the above solution "efficient", but I don't know if there is a more efficient one. As far as I can see, any approach to this problem is going to involve backtracking. You could match the first nine of ten patterns, and then discover that the last one won't match because one of the earlier nine greedily ate up part of what the tenth needed, even though it could have matched itself somewhere later in the string. Therefore, I will go out on a limb and say that this problem is intrinsically of order O(n!).
The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.
now I have two strings,
var str1 = "A10B1C101D11";
var str2 = "A1B22C101D110E1";
What I intend to do is to tell the difference between them, the result will look like
A10B1C101D11
A10 B22 C101 D110E1
It follows the same pattern, one character and a number. And if the character doesn't exist or the number is different between them, I will say they are different, and highlight the different part. Can regular expression do it or any other good solution? thanks in advance!
Let me start by stating that regexp might not be the best tool for this. As the strings have a simple format that you are aware of it will be faster and safer to parse the strings into tokens and then compare the tokens.
However you can do this with Regexp, although in javascript you are hampered by the lack of lookbehind.
The way to do this is to use negative lookahead to prevent matches that are included in the other string. However since javascript does not support lookbehind you might need to go search from both directions.
We do this by concatenating the strings, with a delimiter that we can test for.
If using '|' as a delimiter the regexp becomes;
/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g
To find the tokens in the second string that are not present in the first you do;
var bothstring=str2.concat("|",str1);
var re=/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g;
var match=re.exec(bothstring);
Subsequent calls to re.exec will return later matches. So you can iterate over them as in the following example;
while (match!=null){
alert("\""+match+"\" At position "+match.index);
match=re.exec(t);
}
As stated this gives tokens in str2 that are different in str1. To get the tokens in str1 that are different use the same code but change the order of str1 and str2 when you concatenate the strings.
The above code might not be safe if dealing with potentially dirty input. In particular it might misbehave if feed a string like "A100|A100", the first A100 will not be considered as having a missing object because the regexp is not aware that the source is supposed to be two different strings. If this is a potential issue then search for occurences of the delimiting character.
You call break the string into an array
var aStr1 = str1.split('');
var aStr2 = str2.split('');
Then check which one has more characters, and save the smaller number
var totalCharacters;
if(aStr1.length > aStr2.length) {
totalCharacters = aStr2.length
} else {
totalCharacters = aStr1.length
}
And loop comparing both
var diff = [];
for(var i = 0; i<totalCharacters; i++) {
if(aStr1[i] != aStr2[i]) {
diff.push(aStr1[i]); // or something else
}
}
At the very end you can concat those last characters from the bigger String (since they obviously are different from the other one).
Does it helps you?
I've grown spoiled by ColdFusion's lists, and have run across a situation or two where a comma delimited list shows up in Javascript. Is there an equivalent of listFindNoCase('string','list'), or a performant way to implement it in Javascript?
Oh, and it should be able to handle list items with commas, like:
( "Smith, John" , "Doe, Jane" , "etc..." )
That's what is really tripping me up.
FYI: jList's implementation: https://github.com/davidwaterston/jList
Although, this will fail your requirement that "it should be able to handle list items with commas"
listFind : function (list, value, delimiter) {
delimiter = (typeof delimiter === "undefined") ? "," : delimiter;
var i,
arr = list.split(delimiter);
if (arr.indexOf !== undefined) {
return arr.indexOf(value) + 1;
}
for (i = 0; i < list.length; i += 1) {
if (arr[i] === value) {
return i + 1;
}
}
return 0;
},
listFindNoCase : function (list, value, delimiter) {
delimiter = (typeof delimiter === "undefined") ? "," : delimiter;
list = list.toUpperCase();
value = String(value).toUpperCase();
return this.listFind(list, value, delimiter);
},
One relevant observation here is that CF lists themselves don't support the delimiter char also being part of the data. Your sample "list" of '"Smith, John", "Doe, Jane"' is a four-element comma-delimited list of '"Smith', 'John"', '"Doe', 'Jane"'. To fulfil your requirement here you don't want a JS equiv of CF's listFindNoCase(), because listFindNoCase() does not actually fulfill your requirement in from the CF perspective, and nothing native to CF does. To handle elements that have embedded commas, you need to use a different char as a delimiter.
TBH, CF lists are a bit rubbish (for the reason cited above), as they're only really useful in very mundane situations, which a) don't often crop up; b) aren't better served via an array anyhow. One observation to make here is you are asking about a performant solution here: not using string-based lists would be the first step in being performant (this applies equally to CF as it does to JS: CF's string-based lists are not at all performant).
So my first answer here would be: I think you ought to revise your requirement away from using lists, and look to use arrays instead.
With that in mind, how is the data getting to JS? Are you some how stuck with using a string-based list? If not: simply don't. If your source data is a string-based list, are you in the position to convert it to an array first? You're in trouble with the "schema" of your example list as I mentioned before: from CF's perspective you can't have a comma being both a delimiter and data. And you do have a bit of work ahead of you writing code to identify that a quoted comma is data, and a non-quoted comma is a delimiter. You should have a look around at CSV-parsing algorithms to deal with that sort of thing.
However if you can change the delimiter (to say a pipe or a semi-colon or something that will not show up in the data), then it's easy enough to turn that into an array (listToArray() in CF, or split() in JS). Then you can just use indexOf() as others have said.
For the sake of sh!ts 'n' giggles, if you were stuck with a string - provided you could change the delimiter - you could do this, I think:
use indexOf() to find the position of the first match of the substring in the string, you will need to use a regex to match the substring which is delimited by your delimiter char, or from the beginning of the string to a delimiter char, or from a delimiter char to the end of the string with no intermediary delimiter chars. I could come up with the regex for this if needs must. This will not be list-aware yet, but we know where it'll be in the string.
Take a substring of the original string from the beginning to the position indexOf() returned.
Use split() on that, splitting on the delimiter
The length of the ensuing array will be the position in the original list that the match was at.
However I stress you should not do that. Use an array instead of a string from the outset.
You can use indexOf combined with .toLowerCase()
var list = '"Smith, John" , "Doe, Jane" , "etc..."';
if(list.toLowerCase().indexOf('"Smith, John"'))
If you need an exact match, like "Smith" when "Smithson" exists, just pad the strings with your delimiter. For example, let's say your delimiter is a semi-colon (because you have commas in your string), pad the left and right sides of your string like so:
";Smith, John;Doe, Jane;"
Also pad the search value, so if you're looking for Smith, the value would become:
";Smith;"
.toLowerCase().indexOf() would return -1 (not found). But ";Smith, John;" would return 0