Can't find a proper regex for parsing the string - javascript

So I have the string in the following format
[randomstring] [randomtest]
[randomstring] [texttext...
Data:
{"data}]
So the only thing in common for every line is that all text is stored inside exactly 2 square brackets per line, [text1][text2] . The problem is when the text goes on multiple lines:
[text1][text2
text3
text4]
So I'm looking for a regex to match every [][] pair, per line and came up with this:
https://regex101.com/r/vI0oF6/1
As you can see, only the first line is matched and not the second. Is there a better way ?

You have two options. Use the s modifier to match newlines with ., or simply accept newlines inside the square brackets.
With the s modifier
/(\[.+?\]+\s?\[.+?\])/gs
https://regex101.com/r/vI0oF6/5
Without the s-modifier
/(\[(?:.|\n)+?\]+\s?\[(?:.|\n)+?\])/g
https://regex101.com/r/vI0oF6/6
Note that I'm creating a non-capturing group with the (?:.|\n) syntax.
Also, notice that I used the non-greedy matching token ? inside the square brackets to make it stop matching when it first meets a square bracket instead of being greedy and also matching square brackets with the dot. Visualized, the ? after a quantifier (* or +) does this:
Without ?, .+ is greedy and matches until the last met ]
# Simple example: \[.+\]
[foo][bar]
^--------^
With ?, .+ is non-greedy and matches only until the first met ]
# Simple example: \[.+?\]
[foo][bar]
^---^

Use s modifier to include new lines (Dot matches new line characters).
https://regex101.com/r/vI0oF6/2

Try with this expression (\[[^\]]+\])
http://regexr.com/3duup

I guess this is the one you are looking for (\[[^\]]+\]).
It matches a [ followed by one or more characters other than ], followed by a ]. If you want to match ones without anything inside the brackets as well, use * instead of +.
Note: My understanding is that you need to match [text1] and [text2] from first line, [text1] and
[text2
text3
text4]
from lines 3 to 5 when input is
[text1] [text2]
[text1][text2
text3
text4]

Related

Ambiguity in regex in javascript

var a = 'a\na'
console.log(a.match(/.*/g)) // ['a', '', 'a', '']
Why are there two empty strings in the result?
Let's say if there are empty strings, why isn't there one at beginning and at the end of each line as well, hence 4 empty strings?
I am not looking for how to select 'a's but just want to understand the presence of the empty strings.
The best explanation I can offer for the following:
'ab\na'.match(/.*/g)
["ab", "", "a", ""]
Is that JavaScript's match function uses dot not in DOT ALL mode, meaning that dot does not match across newlines. When the .* pattern is applied to ab\na, it first matches ab, then stops at the newline. The newline generates an empty match. Then, a is matched, and then for some reason the end of the string matches another empty match.
If you just want to extract the non whitespace content from each line, then you may try the following:
print('ab\na'.match(/.+/g))
ab,a
Let's say if there are empty strings, why isn't there one at beginning
and at the end...
.* applies greediness. It swallows a complete line asap. By a line I mean everything before a line break. When it encounters end of a line, it matches again due to star quantifier.
If you want 4 you may add ? to star quantifier and make it lazy .*? but yet this regex has different result in different flavors because of the way they handle zero-length matches.
You can try .*? with both PCRE and JS engines in regex101 and see the differences.
Question:
You may ask why does engine try to find a match at the end of line while whole thing is already matched?
Answer:
It's for the reason that we have a definition for end of lines and end of strings. So not whole thing is matched. There is a left position that has a chance to be matched and we have it with star quantifier.
This left position is end of line here which is a true match for $ when m flag is on. A . doesn't match this position but a .* or .*? match because they would be a pattern for zero-length positions too as any X-STAR patterns like \d*, \D*, a* or b?
Star operator * means there can be any number of ocurrences (even 0 ocurrences). With the expression used, an empty string can be a match. Not sure what are you looking for, but maybe a + operator (1 or more ocurrences) will be better?
Want to add some more info, regex use a greedy algorithm by default (in some languages you can override this behaviour), so it will pick as much of the text as it can. In this case, it will pick the a, because it can be processed with the regex, so the "\na" is still there. "\n" does not match the ".", so the only available option is the empty string. Then, we will process the next line, and again, we can match a "a". After this, only the empty string matches the regex.
* Matches the preceding expression 0 or more times.
. matches any single character except the newline character.
That is what official doc says about . and *. So i guess the array you received is something like this:
[ the first "any character" of the first line, following "nothing", the first "any character" of the second line, following "nothing"]
And the new-line character is just ignored

Regular expression to retrieve from the URL

/test-test-test/test.aspx
Hi there,
I am having a bit difficult to retrieve the first bit out from the the above URL.
test-test-test
I tried this /[\w+|-]/g but it match the last test.aspx as well.
Please help out.
Thanks
One way of doing it is using the Dom Parser as stated here: https://stackoverflow.com/a/13465791/970247.
Then you could access to the segments of the url using for example: myURL.segments; // = Array = ['test-test-test', 'test.aspx']
You need to use a positive lookahead assertion. | inside a character class would match a literal | symbol. It won't act like an alternation operator. So i suggest you to remove that.
[\w-]+(?=\/)
(?=\/) called positive lookahead assertion which asserts that the match must be followed by an forward slash. In our case test-test-test only followed by a forward slash, so it got matched. [\w-]+ matches one or more word character or hyphen. + repeats the previous token one or more times.
Example:
> "/test-test-test/test.aspx".match(/[\w-]+(?=\/)/g)
[ 'test-test-test' ]
[\w+|-] is wrong, should be [\w-]+. "A series of characters that are either word characters or hyphens", not "a single character that is a word character, a plus, a pipe, or a hyphen".
The g flag means global match, so naturally all matches will be found instead of just the first one. So you should remove that.
> '/test-test-test/test.aspx'.match(/[\w-]+/)
< ["test-test-test"]

Regexp match optional group unless it's got something inside it

I'm playing around with this regexp: http://regex101.com/r/dL3qX1
!\[(.*?)\](?:\(\)|\[\])?
All the below strings should match. However, should the second set of brackets, that is optional, contain anything within it, the regexp should match nothing.
// Match
![]
![caption]
![]()
![caption]()
![][]
![caption][]
// No match
![][No match]
![caption][No match]
![](No match)
![caption](No match)
I should still be able to match examples that have text at the end of the line.
![] hello
![caption][] hi there
In other words, I only want a match if there is no optional group, or if there is, I only want a match if the optional group is empty (nothing between the brackets).
Is what I'm after possible?
I personally prefer using negated class when it comes to brackets:
^!\[([^\[\]]*)\](?:\(\)|\[\])?$
regex101 demo
I substituted (.*?) to [^\[\]]*, added ^ and $ at the beginning and end respectively.
That is, if I understood what you're looking for correctly, only the first set is matching.
You can use this regex:
^!\[[^\]]*\](?:\(\)|\[\])?$
Working Demo: http://regex101.com/r/eX0sR8
Note use of [^\]]* instead of .*? in the first square brackets which makes sure to match until very first ]. Also better to use line start/end anchors ^ and $

regular expression to replace with ','

I have one RegExp, could anyone explain exactly what it does?
Regexp
b=b.replace(/(\d{1,3}(?=(?:\d\d\d)+(?!\d)))/g,"$1 ")
I think it is replacing with space(' ')
if i'm right, i want to replace it with comma(,) instead of space(' ').
To explain the regex, let's break it down:
( # Match and capture in group number 1:
\d{1,3} # one to three digits (as many as possible),
(?= # but only if it's possible to match the following afterwards:
(?: # A (non-capturing) group containing
\d\d\d # exactly three digits
)+ # once or more (so, three/six/nine/twelve/... digits)
(?!\d) # but only if there are no further digits ahead.
) # End of (?=...) lookahead assertion
) # End of capturing group
Actually, the outer parentheses are unnecessary if you use $& instead of $1 for the replacement string ($& contains the entire match).
The regex (\d{1,3}(?=(?:\d\d\d)+(?!\d))) matches any 1-3 digits ((\d{1,3}) that is followed by a multiple of 3 digits ((?:\d\d\d)+), that isn't followed by another digit ((?!\d)). It replaces it with "$1 ". $1 is replaced by the first capture group. The space behind it is... a space.
See regexpressions on mdn for more information about the different syntaxes.
If you want to seperate the numbers with a comma, instead of a space, you'll need to replace it with "$1," instead.
Don't try to solve everything by using regular expressions.
Regular expressions are meant for matching, not to fix non-text-encoded-as-text formatting.
If you want to format numbers differently, extract them and use format strings to reformat them on a character processing level. That is just an ugly hack.
It is okay to use regular expressions to find the numbers in the text, e.g. \d{4,} but trying to do the actual formatting with regexp is a crazy abuse.

Regex get all text from # to quotation

Okay so I currently have:
/(#([\"]))/g;
I want to be able to check for a string like:
#23ad23"
Whats wrong with my regex?
Your regex (/(#([\"]))/g) breaks down like this:
without start/end delimiters/flags and capturing braces..
#[\"]
which just means #, followed by ", but the square brackets for the class are unnecessary, as there is only one item, so equivalent to...
#"
I think you want to match all characters between # and " inclusive (and captured exclusively).
Start with regex like this:
#.+?"
Which means # followed by anything (.) one or more times (+) un-greedily (?) followed by "
so with the capturing brackets, and delimeters...
/(#(.+?)")/g
Is this how you mean?
/(#([^\"]+))/g;
This will include everything until it reaches the " char.
For minimum match count (bigger-length matches): #(.+)\"
For maximum match count (smaller-length matches): #(.+?)\"

Categories

Resources