Matching First and Last Occurrences Only - JavaScript - javascript

So I am making a simple BBCode parser in JavaScript, nothing too fancy. I first need to get a regular expression that will match only BBCode and will only match the first and last occurrences of the tag. This will help with items that are nested in each other such as
[b][c red]This should output bold red text[/c][/b]
which should be parsed to
<span style="font-weight: bold;><span style="color: red;">This should output bold red text</span></span>
The current "Master" regex (the one that detects if there is any BBCode in the string) is as follows.
(\[{1}([^\[]{1,3})(| .*?)\]{1}(.*?)\[{1}(\/{1}[^\]]{1,3})\]{1})
Is there any way to alter this in order to detect only the first and last matches?
Note: I want to exclude wikilinks such as [[Main Page]]

Regular expressions wouldn't be the right tool for the job, just like it isn't the right job for parsing HTML. This is because it is a context-free language and not a regular language (hence regular expression).
However, I can never complain with someone working on something as a "small problem solving exercise" (that's why I'm on SO). You said my comment helped, so I'll post it and add an explanation.
\[(\w{1,3})\](.*)\[\/\1\]
<$1>$2</$1>
First we look for [ followed by our first capturing group of 1-3 "word" characters ([a-zA-Z0-9_]) followed by the ]. This \w can be replaced with [^\]] to match any character but the closing bracket or really anything else of your choosing (I'm not entirely sure of the BBCode specs and what a tag can consist of). Then we will (greedily) capture 0+ characters into another group. Finally, we look for a [\ containing our first captured group (\1 which references to \w{1,3}) followed by a ]. Since we used a greedy capture with (.*), it will keep going until it gets to the last closing tag.
Now we have 2 captured groups, one with the tag and one with the contents. You can change the [ to < by simply referencing the groups: <$1>$2</$1>
Regex101

Related

Regex to capture everything (incuding newlines) within a capturing group

I have some text (which some of you will recognize as part of a fortune file)
A day for firm decisions!!!!! Or is it?
%
A few hours grace before the madness begins again.
%
A gift of a flower will soon be made to you.
%
A long-forgotten loved one will appear soon.
Buy the negatives at any price.
%
As can be seen, this contains both single line text and block text (as seen in the last fortune).
I currently have a regex that will capture all single line fortunes, however, it does not capture the multiline fortune.
(?<=%\n)(.*?)(?=\n%)
I understand that there is a /m multiline option, however, I do not want the entire regex to be multiline enabled (I have not gotten it to work at all in that way).
So my question is: How can I select multiline text blocks between delimiters as a local capture group? It should be noted that I will be using this in JavaScript.
Try this,
str.split(/\n%\n/)
This splits the string by lines that contain % only.
To match a newline, you can use [^] or [\s\S]. The dot does not match. This has nothing to do with the m flag, which has to do with whether anchors (^ and $) match at the beginning and ends of lines. Other regexp engines have syntax for making the dot match newlines, and one is being proposed for a future version of JS, but for the time being you'll have to use one of the approaches above.
[^] literally means "match any character which is not nothing", which, as it turns out, includes a newline; [\s\S] literally means "match any character which is either whitespace or not whitespace", which also includes a newline.
Assuming your regexp currently works except for this newline problem, use
(?<=%\n)([^]*?)(?=\n%)
See this SO question. There is some information on this in Eloquent JavaScript. The TC39 proposal (for a new s flag) is here.

Javascript REGEX to match multiple custom tags (also incomplete)

I'm trying to match "custom" tags that might be complete/incomplete as described below.
The bold text is what I'm trying to match.
%end{some text
%start{some text
%start{some text}%end
%start{some text}%end%start{more text}%end
Also, these tags can appear multiple times within a string. For example, the regex:
/%start(.*)%end/gi
applied on the 4th example would capture:
%start{some text}%end%start{more text}%end
How would I go on about tho achieve the matches described on the first 4 examples?
If your data can contain multiple tags on a line, with unclosed tags in other positions than the last one, and the tag content can contain %, it's a little tricky:
Use /%(?:start|end){((?:(?!%(?:start|end){)[^}])+)/g and retrieve the first group.
Here is a regex101 test.
Note that it is about 3 times more expensive than the next two expressions, taking 112 steps to match your fourth data example, while the other two only take 34 steps.
If your data can contain multiple tags on a line, with unclosed tags in other positions than the last one, but the tag content can't contain %, it's already a lot easier :
Use /%(?:start|end){([^}%]+)/g and retrieve the first group.
Here is a regex101 test. Note how it fails on the last dataset.
If your data can't contain unclosed tags in other positions than the last one, it's even easier :
Use /%(?:start|end){([^}]+)/g and retrieve the first group.
Here is a regex101 test. Note that you will need to add linefeed characters to the negated class if you parse multiple lines at once, and also how it fails on the last two dataset.
You can use this pattern:
/%start([^%]*(?:%(?!end)[^%]*)*)(?:%end)?/gi
The idea is to describe the content in a greedy way that can't match the closing tag and to make the closing tag optional.
[^%]* # all that is not a %
(?:
%(?!end) # a % not followed by "end"
[^%]*
)*
I assume that first tag is invalid as it does not have %start and if you omit %end than tag ends at last word.
So regex would be (example): %start{([a-z0-9\s]+)}?
You could try to use this one:
/{([a-z0-9 ]*)}/gi
You can see the result on there:
https://regex101.com/r/uY8jE5/1

How to count two words as 1 in same line

In the text file I've got, each sentence is represented with a specific type such as: contrast.
A contrasting sentence can either be represented with a tag "CONTRAST" or "CONTR" or "WEAKCONTR". For instance:
IMPSENT_CONTRAST_VIS(Studying networks in this way can help to
identify the people from whom an individual learns , where
conflicts_MD:+ in understanding_MD:+ may originate , and which
contextual factors influence learning .)
So I count these with following expression: /(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\())/g which works perfectly fine.
Now the problem is some sentences are expressed with more than one contrast tag such as CONTR & WEAKCONTR together. For instance:
IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS(Studying_MD:+ networks in this way
can help to identify_MD:+ the people from whom an individual learns
, where conflicts_MD:+ in understanding_MD:+ may originate , and
which contextual factors influence learning .)
At this point I have to count these as 1 not 2. Do you have any idea how possible this is with RegExp?
You can use lookaheads to assert it, and then count the matches:
(?=\w*_(?:WEAK|CONTRAST|CONTR[_)]))\b\w+\b
Demo here: http://regex101.com/r/xP2yI7/3
Notice the match count.
This will match the whole IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS expression, but only if it matches the part in the lookahead, which filters for the keywords you're looking after. This will match even if you have multiple such sentences on the same line.
Also, I've simplified your regex a bit, retaining the same meaning. Notice you don't have to escape the _.
You really just care if the tag shows up in the line at all, so just grab the whole line, provided it has a tag, like so:
/^([A-Z_]+(WEAK|CONTRAST|CONTR)+[A-Z_]*)/gm
From the start of the line ^ look for a word block with A-Z or _ followed by the tag, optionally followed by more words/underscores.
DEMO
Can you try adding \w+:
/(\_(WEAK\w+))|(\_CONTRAST\w+)|(\_CONTR(\_\w+|\())/g
Something like this?
(^(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\()))

Complex string parsing in Javascript

I am attempting to parse a complex string in JavaScript, and I'm pretty horrible with Regular Expressions, so I haven't had much luck. The data is loaded into a variable formatted as follows:
Miami 2.5 O (207.5) 125.0 | Oklahoma City -2.5 U (207.5) -145.0 (Feb 20, 2014 08:05 PM)
I am trying to parse that string following these parameters:
1) Each value must be loaded into their own variable (IE: separate variables for Miami, 2.5 O, (207.5) ect)
2) String must split at pipe character (I have this working with .split(" | ") )
3) I am dealing with city names that include spaces
4) The date at the end must be isolated and removed
I have a feeling regular expressions must be used, but I'm seriously hoping there is a different way to approach this. The example provided is just that, an example from a much larger data set. I can provide the full data set if requested.
More direct version of my question: Given the data above, what concepts / procedures can I use to intelligently parse the string elements into their own variables?
If RegEx must be used, will I need multiple expressions?
Thanks in advance for your help!
EDIT: In an effort to supply multiple pathways to a solution I'll explain the overarching problem as well. This data is the return of a RSS / XML item. The string mentioned above is sports odds, and is all contained in the title node of the feed I'm using. If anyone has a better XML / RSS feed for sports odds, I would be ecstatic for that as well.
EDIT 2: Thanks to the replies, I can run a RegEx that matches the data points needed. I'm now having trouble iterating through the matches and returning them correctly. I have the RegEx loaded into its own function:
function regExExtract (txt){
var exp = /([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g;
var comp_arr = exp.exec(txt);
return comp_arr;
}
And it is being called with:
var title_arr = regExExtract(title);
Title is loaded with the data string listed above. I assume I'm using the global flag correctly to ensure all matches are considered, but I'm not sure I'm loading the matches correctly. I apologize for my ignorance, this is all brand new to me.
As requested below, my expected output is ultimately a table with a row for each city, and its subsequent data. Each cell in each row corresponds to a data point.
I have created a JS Fiddle with what I've done, and what the expected output is:
http://jsfiddle.net/vDkQD/2/
Potential Final Edit: With the assistance of Robin and rewt, I have come up with:
http://jsfiddle.net/hMJx3/
Wouldn't a regex like
/([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g
do the trick? Obviously, this is based on the example string you gave, and if there are other patterns possible this should be updated... But if it is that fixed it's not so complicated.
Afterwards you just have to go through the captured groups for each match, and you'll have your data parsed. Live demo for fun: http://regex101.com/r/kF5zD3
Explanation
[^|\d] evrything but a pipe or a digit. This is to account for strange city name that [a-zA-Z ] might not catch
[-\d.] a digit, a dot or a hyphen
\([^)]+\) opening parenthesis, everything that isn't a closing parenthesis, closing parenthesis.
Quick incomplete pointers on regex
Here, the regex is the part between the /. The g after is a flag, thanks to it the regex won't stop after hitting the first match and will return every match
The match is what the whole expression will find. Here, the match will be everything between two | in your string. The capturing groups are a very useful tool that allows you too extract data from this match: they are delimited by parenthesis, which are a special character in regex. (a)b will match ab, the first captured group of this match will be a
[...] is means every character inside will do. [abc] will match a or b or c.
+ is a quantifier, another special character, meaning "one or more of what precedes me". a+ means "one or more a and will match aaaaa.
\d is a shortcut for [0-9] (yes, - is a special range character inside of [...]. That's why in [-\d.], which is equivalent to [-0-9.], it's directly following the opening bracket)
since parenthesis are special characters, when you actually want to match a parenthesis you need to escape: regex (\(a\))b will match (a)b, the first captured group of this match will be (a) with the parenthesis
? means what precedes is optional (zero or one instances)
^ when put at the beginning of a [...] statement means "everything but what's in the brackets". [^a]+ will match bcd-*รน but not aa
If you really know nothing about regex, as I believe they're the right tool for your case, I suggest your take a quick overview of a tuto, just to get a better idea of what you're dealing with. The way to set flags, loop through matches and their respective captured groups will depend on your language and how you call your regex.
[A-z][a-z]+( [A-z][a-z]+)* -?[0-9]+\.[0-9] [OU] \(-?[0-9]+\.[0-9]\) -?[0-9]+\.[0-9]
This should match a single part of your long string under the following assumptions:
The city consists only of alpha characters, each word starts with an uppercase character and is at least 2 characters long.
Numbers have an optional sign and exactly one digit after the decimal point
the single character is either O or U
Now it is up to you to:
Properly create capturing parentheses
Check whether my assumptions are right
In order to match the date:
\([JFMASOND][a-z]{2} [0-9]?[0-9], [0-9]{4} [0-9]{2}:[0-9]{2} [AP]M\)$

Regex for parsing a search field for keywords and tags ([])

I'm trying to implement a search input similar to Stackoverflow's (in node.js/javascript).
Parse tags delimited by brackets
Parse keywords delimited by spaces
However, I don't understand regex at all. I don't even know if regexs are the way to go.
For example:
search field [search][search-query] [search-string]
// keywords: ['search', 'field']
// tags: ['search', 'search-query', 'search-string']
Unfortunately, I find it additionally difficult to search any help on this since searching for regex search tags returns HTML questions
Think you'll need something like this:
/(?:\[([^\]]*)\]|([^\s]+))/g
You can apply it repeatedly (e.g. using the Javascript exec method) and then extract values from the first and second capturing groups to capture tags and keywords respectively.
Try it out here:
http://refiddle.com/85o
To explain:
The outermost () brackets enclose a choice of matching either a tag (enclosed by square brackets []) or a keyword (not enclosed by square brackets). The ?: bit excludes this choice bit from a capturing group since we need to know specifically whether the matched expression is a tag or keyword and so need a separate capturing group for each.
The next bit [([^]]*)] matches a tag: the opening and closing square brackets need to be escaped with a backslash to make them literals. The bit within the square brackets is enclosed in normal brackets () to capture the text within in the first capturing group. The [^...] bit matches anything except what is listed after the caret - so in this case anything except the closing square bracket. This is repeated greedily using the *.
The | separates the choice and then we have the matching expression for a keyword: ([^\s]+). Again this is in brackets to make the results appear in a capturing group. This time we are matching anything except for whitespace one or more times.
Finally the /g is the global modifier so that all occurrences are matched.
The following code retrieves the tags from the string as an array:
var tags= "search field [search][search-query] [search-string]".match(/\[(.*?)\]/g);
// tags= ["search", "search-query", "search-string"]

Categories

Resources