Regex issues in javascript? [duplicate] - javascript

I have this nice regex:
*(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+
that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?

Here's a simpler example that demonstrates the issue:
console.log(
'34'.match(/(?:(3)|(4))+/)
);
In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:
The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.
This is described in the specification:
Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
and
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
because each iteration of the outermost * clears all captured Strings contained in the quantified Atom
In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.

Related

Regex Javascript Capture groups with quantifier Not Working

I have this nice regex:
*(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+
that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?
Here's a simpler example that demonstrates the issue:
console.log(
'34'.match(/(?:(3)|(4))+/)
);
In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:
The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.
This is described in the specification:
Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
and
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
because each iteration of the outermost * clears all captured Strings contained in the quantified Atom
In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.

How can I capture an optional asterisk with my regex?

I am trying to get sections from an API which is in markdown, so I'm using this:
(?<=\*\*(Test)\*\*)(.*?)(?=\*\*end\*\*)
https://regex101.com/r/r8aiVk/1
Result here should be Test and then this is a test, which it is. Awesome.
This works fine, however, some titles have an asterisk at the end, which is where I'm running into an issue. I loop through titles with the one regex, but I want to capture that one optional asterisk.
So with this example following, I want to be able to capture the asterisk along with the rest:
https://regex101.com/r/r8aiVk/2
The result here should be Test* this is a test.
I've tried various different ways, such as (\*?) and a few other variants, but I am unable to get this working.
The lookbehind implementation in JavaScript tricked you: to match the lookbehind pattern, the regex iterator goes backwards, and tries to match its pattern that way. Since it is executed at each location (your lookbehind is the first atom in the regex), it checks the start of string, then *, then **, then T, etc. and once it matches **Test**, it calls it a day. So, the next * is consumed with .*?.
You can get what you need using a mere consuming pattern:
/\*\*(Test\*?)\*\*(.*?)\*\*end\*\*/g
See the regex demo.
This pattern will be processed normally, from left to right, matching
\*\* - a ** substring
(Test\*?) - capturing Test or Test* into Group 1
\*\* - a ** substring
(.*?) - Capturing group 2: any 0+ chars other than line break chars, as few as possible
\*\*end\*\* - **end** substring.

Complex string parsing in Javascript

I am attempting to parse a complex string in JavaScript, and I'm pretty horrible with Regular Expressions, so I haven't had much luck. The data is loaded into a variable formatted as follows:
Miami 2.5 O (207.5) 125.0 | Oklahoma City -2.5 U (207.5) -145.0 (Feb 20, 2014 08:05 PM)
I am trying to parse that string following these parameters:
1) Each value must be loaded into their own variable (IE: separate variables for Miami, 2.5 O, (207.5) ect)
2) String must split at pipe character (I have this working with .split(" | ") )
3) I am dealing with city names that include spaces
4) The date at the end must be isolated and removed
I have a feeling regular expressions must be used, but I'm seriously hoping there is a different way to approach this. The example provided is just that, an example from a much larger data set. I can provide the full data set if requested.
More direct version of my question: Given the data above, what concepts / procedures can I use to intelligently parse the string elements into their own variables?
If RegEx must be used, will I need multiple expressions?
Thanks in advance for your help!
EDIT: In an effort to supply multiple pathways to a solution I'll explain the overarching problem as well. This data is the return of a RSS / XML item. The string mentioned above is sports odds, and is all contained in the title node of the feed I'm using. If anyone has a better XML / RSS feed for sports odds, I would be ecstatic for that as well.
EDIT 2: Thanks to the replies, I can run a RegEx that matches the data points needed. I'm now having trouble iterating through the matches and returning them correctly. I have the RegEx loaded into its own function:
function regExExtract (txt){
var exp = /([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g;
var comp_arr = exp.exec(txt);
return comp_arr;
}
And it is being called with:
var title_arr = regExExtract(title);
Title is loaded with the data string listed above. I assume I'm using the global flag correctly to ensure all matches are considered, but I'm not sure I'm loading the matches correctly. I apologize for my ignorance, this is all brand new to me.
As requested below, my expected output is ultimately a table with a row for each city, and its subsequent data. Each cell in each row corresponds to a data point.
I have created a JS Fiddle with what I've done, and what the expected output is:
http://jsfiddle.net/vDkQD/2/
Potential Final Edit: With the assistance of Robin and rewt, I have come up with:
http://jsfiddle.net/hMJx3/
Wouldn't a regex like
/([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g
do the trick? Obviously, this is based on the example string you gave, and if there are other patterns possible this should be updated... But if it is that fixed it's not so complicated.
Afterwards you just have to go through the captured groups for each match, and you'll have your data parsed. Live demo for fun: http://regex101.com/r/kF5zD3
Explanation
[^|\d] evrything but a pipe or a digit. This is to account for strange city name that [a-zA-Z ] might not catch
[-\d.] a digit, a dot or a hyphen
\([^)]+\) opening parenthesis, everything that isn't a closing parenthesis, closing parenthesis.
Quick incomplete pointers on regex
Here, the regex is the part between the /. The g after is a flag, thanks to it the regex won't stop after hitting the first match and will return every match
The match is what the whole expression will find. Here, the match will be everything between two | in your string. The capturing groups are a very useful tool that allows you too extract data from this match: they are delimited by parenthesis, which are a special character in regex. (a)b will match ab, the first captured group of this match will be a
[...] is means every character inside will do. [abc] will match a or b or c.
+ is a quantifier, another special character, meaning "one or more of what precedes me". a+ means "one or more a and will match aaaaa.
\d is a shortcut for [0-9] (yes, - is a special range character inside of [...]. That's why in [-\d.], which is equivalent to [-0-9.], it's directly following the opening bracket)
since parenthesis are special characters, when you actually want to match a parenthesis you need to escape: regex (\(a\))b will match (a)b, the first captured group of this match will be (a) with the parenthesis
? means what precedes is optional (zero or one instances)
^ when put at the beginning of a [...] statement means "everything but what's in the brackets". [^a]+ will match bcd-*ù but not aa
If you really know nothing about regex, as I believe they're the right tool for your case, I suggest your take a quick overview of a tuto, just to get a better idea of what you're dealing with. The way to set flags, loop through matches and their respective captured groups will depend on your language and how you call your regex.
[A-z][a-z]+( [A-z][a-z]+)* -?[0-9]+\.[0-9] [OU] \(-?[0-9]+\.[0-9]\) -?[0-9]+\.[0-9]
This should match a single part of your long string under the following assumptions:
The city consists only of alpha characters, each word starts with an uppercase character and is at least 2 characters long.
Numbers have an optional sign and exactly one digit after the decimal point
the single character is either O or U
Now it is up to you to:
Properly create capturing parentheses
Check whether my assumptions are right
In order to match the date:
\([JFMASOND][a-z]{2} [0-9]?[0-9], [0-9]{4} [0-9]{2}:[0-9]{2} [AP]M\)$

What is the purpose of the passive (non-capturing) group in a Javascript regex?

What is the purpose of the passive group in a Javascript regex?
The passive group is prefaced by a question mark colon: (?:group)
In other words, these 2 things appear identical:
"hello world".match(/hello (?:world)/)
"hello world".match(/hello world/)
In what situations do you need the non capturing group and why?
Two use cases for capturing groups
A capturing group in a regex has actually two distinct goals (as the name "capturing group" itself suggests):
Grouping — if you need a group to be a treated as a single entity in order to apply some stuff to the whole group.
Probably the most trivial example is including an optional sequence of characters, e.g. "foo" optionally followed by "bar", in regex terms: /foo(bar)?/ (capturing group) or /foo(?:bar)?/ (non-capturing group). Note that the trailing ? is applied to the whole group (bar) (which consists of a simple character sequence bar in this case). In case you just want to check if the input matches your regex, it really doesn't matter whether you use a capturing or a non-capturing group — they act the same (except that a non-capturing group is slightly faster).
Capturing — if you need to extract a part of the input.
For example, you want to get number of rabbits from an input like "The farm contains 8 cows and 89 rabbits" (not very good English, I know). The regex could be /(\d+)\s*rabbits\b/. On successful match, you can get the value matched by the capturing group from JavaScript code (or any other programming language).
In this example, you have a single capturing group, so you access it via its index 0 (see this answer for details).
Now imagine you want to ensure that the "place" is called "farm" or "ranch". If it's not the case, then you don't want to extract the number of rabbits (in regex terms — you don't want the regex to match).
So you rewrite your regex as follows: /(farm|ranch).*\b(\d+)\s*rabbits\b/. The regex works by itself, but your JavaScript is broken — there are two capturing groups now and you must change your code to get the contents of the second capturing group for the number of rabbits (i.e. change index from 0 to 1). The first group now contains the string "farm" or "ranch", which you didn't intend to extract.
A non-capturing group comes to rescue: /(?:farm|ranch).*\b(\d+)\s*rabbits\b/. It still matches either "farm" or "ranch", but doesn't capture it, thus not shifting the indexes of subsequent capturing groups. And your JavaScript code works fine without changing.
The example may be oversimplified, but consider that you have a very complex regex with many groups, and you want to capture only few of them. Non-capturing groups are really helpful then — you don't have to count all of your groups (only capturing ones).
Besides, non-capturing groups serve documentation purposes: for someone who reads you code, a non-capturing group is an indication that you are not interested in extracting contents, you just want to ensure that it matches.
A few words on separation of concerns
Capturing groups are a typical example of breaking the SoC principle. This syntax construct serves two distinct purposes. As the problems herewith became evident, an additional construct (?:) was introduced to disable one of the two features.
It was just a design mistake. Maybe a lack of "free special characters" played its role... but it was still a poor design.
Regex is a very old, powerful and widely used concept. For the reasons of backwards compatibility, this flaw is now unlikely to be fixed. It's just a lesson of how important the separation of concerns is.
Non-capturing have just one difference from "normal" (capturing) groups: they don't require the regex engine to remember what they matched.
The use case is that sometimes you must (or should) use a group not because you are interested in what it captures but for syntactic reasons. In these situations it makes sense to use a non-capturing group instead of a "standard" capturing one because it is less resource intensive -- but if you don't care about that, a capturing group will behave in the exact same manner.
Your specific example does not make a good case for using non-capturing groups exactly because the two expressions are identical. A better example might be:
input.match(/hello (?:world|there)/)
In addition to the answers above, if you're using String.prototype.split() and you use a capturing group, the output array contains the captured results (see MDN). If you use a non-capturing group that doesn't happen.
var myString = 'Hello 1 word. Sentence number 2.';
var splits = myString.split(/(\d)/);
console.log(splits);
Outputs:
["Hello ", "1", " word. Sentence number ", "2", "."]
Whereas swapping /(\d)/ for /(?:\d)/ results in:
["Hello ", " word. Sentence number ", "."]
When you want to apply modifiers to the group.
/hello (?:world)?/
/hello (?:world)*/
/hello (?:world)+/
/hello (?:world){3,6}/
etc.
Use them when you need a conditional and don't care about which of the choices cause the match.
Non-capturing groups can simplify the result of matching a complex expression. Here, the group 1 is always the name speaker. Without the non-capturing group, the speaker's name may end up in group 1 or group 2.
/hello (?:world|foobar )?said (.+)/
I have just found a different use for it. I was trying to capture a nested group but wanted the whole collection of the repeating group as one element:
So for AbbbbC
(A)((?:b)*)(C)
gives three groups A bbbb C
for AC also gives three groups A null C

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

Categories

Resources