Is regex a costly operation? It seems atleast - javascript

I was writing a regex pattern for a string. The string being a constant type/structure. What I mean is, it looks like(this format is not so important, have a look at the example next)-
[Data Code Format]: POP N/N/N (N: object): JSON object data
Here N represents a number or digit. and what's inside [ ] is a set of string block. But, this format is constant.
So, I wrote a regex-
\s*((?:\S+\s+){1}\S+)\s*(?:\((?:.*?)\))?:\s*(\S*)\s*(\w+)
Keeping this string example in mind-
%DATA-4-JSON: POP 0/1/0 (1: object): JSON object data
It works perfectly, but, what I see on regex101.com is that there is a successful match. But, it has undergone 330 steps to achieve this.
Screenshot-
My question is, its taking 330 steps to achieve this(atleast in my mind I feel its pretty heavy), which I guess can be achieved using if-else and other comparisons with lesser steps right?
I mean, is regex string parsing so heavy? 330 steps for like 10000's of strings I need to parse is going to be heavy right?

When you are using regexps, they can be costly if you use backtracking. When you use quantifiers with consequent patterns that may match one another (and in case the patterns between them may match empty strings (usually, declared with *, {0,x} or ? quantifiers)), backtracking might play a bad trick on you.
What is bad about your regex?
A slight performance issue is caused by an unnecessary non-capturing group (?:\S+\s+){1}. It can be written as \S+\s+ and it will already decrease the number of steps (a bit, 302 steps). However, the worst part is that \S matches the : delimiter. Thus, the regex engine has to try a lot of possible routes to match the beginning of your expected match. Replace that \S+\s+\S with [^:\s]+\s+[^:\s]+, and the step amount will decrease to 159!
Now, coming to (?:\((?:.*?)\))? - removing the unnecessary inner non-capturing group gives another slight improvement (148 steps), and replacing .*? with a negated character class gives another boost (139 steps).
I'd use this for now:
\s*([^:\s]+\s+[^:\s]+)\s*(?:\([^()]*\))?:\s*(\S*)\s*(\w+)
If the initial \s* was obligatory, it would be another enhancement, but you would have different matches.

Related

Regex issue (long string)

I have RegExp condition is /^([0-9]*\.?[0-9])*$/ to test string.
My string are first is 1.2.840.346991791506342.1482500253171661(large string) & second is 1.2.3.201922311129.10038 (short string).
It successfully search as both strings are OK.
But when I add space at the last of second string short string it's showing invalid that is right conclusion.
But when I add space in first string it should display invalid string as per code but it gets hanged why it is showing hang?
RegExp limit is exhausted? What will be the solution?
You can check this in notepad+ for testing purpose ^([0-9]*\.?[0-9])*$ use this formula directly.
The way you have written your regex, having nested quantifier is leading it to catastrophic backtracking leading it to hang/timeout.
Catastrophic Backtracking Demo
You need to simplify your regex to something like this,
^[0-9]*(?:\.[0-9]+)*$
Let me know if this regex preserves your pattern.
Regex Demo not running into timeout
You should in general avoid over nesting quantifiers in your regex, and rather try writing them in a simpler manner as much as you can. Even for short string like 1.2.840.3469931313.313, see how much steps your regex is taking,
135228 steps taken
and if you increase your string length little bit, then it runs into timeout/catastrophic backtracking.

is there an algorithm that indexes spaces instead of characters?

New to this and was wondering after reading up on indexing and character counts. wouldn't it be more applicable to index spaces instead of characters to improve matching of words?
Looking at the example below, it selects/counts the white spaces at the end of every word. But I want it to count or recognize the space at the end of a word and the beginning of the following word, essentially noticing/collating white space characters. Does that make any sense?
var str = 'This is a string',
index = 0,
res = [];
while ((index = str.indexOf(' ', index + 1)) > 0) {
res.push(index);
}
console.log(res)
Short answer: No, but you can split it into an array of strings by spaces with String.prototype.split().
Long answer:
You could think about this in a few ways... and this applies to more languages than just JS:
Implementation
Strings are more akin to an array of characters. This is easier visualized if you think about how a string would be recognized by a computer: as a number (wait what?).
As I'm sure you know, computers can really only see 1's and 0's, so some smart people came up with a way to represent what character with what number... and then define some special things when some characters come in certain orders, but that's a whole other can of worms, these definitions are what we call charsets. Note that the number of bits available per character is defined by the charset i.e. UTF-8 and UTF-16
So returning to the original question "Why index by characters and not spaces?": because we're lazy and that was easy it's actually pretty convenient for reasons I'm about to elaborate on.
Mathematics
Let's be honest with ourselves, this is Computer Science, so we should probably back up our reasons with math... which is where Formal Languages come in.
A formal language L over an alphabet Σ is a subset of Σ*, that is, a set of words over that alphabet.
A formal grammar is a set of production rules for strings in a formal language.
A word over an alphabet can be any finite sequence (i.e., string) of letters.
In mathematics a sequence is an enumerated collection of objects in which repetitions are allowed.
Note: Be aware that this post is in no way a proper description of formal languages, just a collection of generalized descriptions provided by wikipedia
Which leads me to grammars, your case of a string being indexed by a " " is really more of something that is defined as a grammar and is really a concept derived from those nonsensical human languages.
So what this means is if you boil down what a string really is and what defines it, you can see that it is defined all the way down on the level of mathematics.
Practicality
But wait, I am human why does that still apply?
Well think about it this way, a string can hold more than just a sentence right? Take JSON for example "{\"key1\":\"hello world\",\"key2\":0}" would it make sense to have this string indexed by spaces? There's also the issue that substrings are a lot more tricky because we can't reference the individual characters anymore, so even iterating over a string would become a complicated task.
So why not make another data type?
Honestly this is just a reiteration of before: is it really necessary? Is it common enough of a problem to warrant an entirely new datatype when the key difference is more or less splitting the string and not allowing the programmer to look up the characters individually?
An Algorithm
Well this is probably the easiest part of the answer... now that we have a general understanding of what a string is. As with all programming, there's a ton of way to go about it: I'm specifically sticking with JS functions, because that was tagged in the question.
As #Redu mentioned, there's the String.prototype.split() method (which is the easiest way I know of). Which allows you to split a string based on another string, or a regular expression (these are also defined in formal languages, most programming languages that have them have much more "featured" regular expressions, they can also be used to describe some grammars). So to split a string based on a ' ' we can do one of three ways (2 are regular expression approaches).
console.log("Hello World!".split(" ")); // Split on all instances of a single space
console.log("Foo bar".split(/ /g)); // RegEx split on all single spaces
console.log("A RegEx".split(/\s+/g)); // RegEx split on one or more whitespaces (not necessarily just a space)
console.log("A RegEx".split(" ")); // For comparison (multiple spaces)
TL;DR
Blame math There are quite a few reasons, but essentially, it really doesn't make sense to.. why not split the string?

Complex string parsing in Javascript

I am attempting to parse a complex string in JavaScript, and I'm pretty horrible with Regular Expressions, so I haven't had much luck. The data is loaded into a variable formatted as follows:
Miami 2.5 O (207.5) 125.0 | Oklahoma City -2.5 U (207.5) -145.0 (Feb 20, 2014 08:05 PM)
I am trying to parse that string following these parameters:
1) Each value must be loaded into their own variable (IE: separate variables for Miami, 2.5 O, (207.5) ect)
2) String must split at pipe character (I have this working with .split(" | ") )
3) I am dealing with city names that include spaces
4) The date at the end must be isolated and removed
I have a feeling regular expressions must be used, but I'm seriously hoping there is a different way to approach this. The example provided is just that, an example from a much larger data set. I can provide the full data set if requested.
More direct version of my question: Given the data above, what concepts / procedures can I use to intelligently parse the string elements into their own variables?
If RegEx must be used, will I need multiple expressions?
Thanks in advance for your help!
EDIT: In an effort to supply multiple pathways to a solution I'll explain the overarching problem as well. This data is the return of a RSS / XML item. The string mentioned above is sports odds, and is all contained in the title node of the feed I'm using. If anyone has a better XML / RSS feed for sports odds, I would be ecstatic for that as well.
EDIT 2: Thanks to the replies, I can run a RegEx that matches the data points needed. I'm now having trouble iterating through the matches and returning them correctly. I have the RegEx loaded into its own function:
function regExExtract (txt){
var exp = /([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g;
var comp_arr = exp.exec(txt);
return comp_arr;
}
And it is being called with:
var title_arr = regExExtract(title);
Title is loaded with the data string listed above. I assume I'm using the global flag correctly to ensure all matches are considered, but I'm not sure I'm loading the matches correctly. I apologize for my ignorance, this is all brand new to me.
As requested below, my expected output is ultimately a table with a row for each city, and its subsequent data. Each cell in each row corresponds to a data point.
I have created a JS Fiddle with what I've done, and what the expected output is:
http://jsfiddle.net/vDkQD/2/
Potential Final Edit: With the assistance of Robin and rewt, I have come up with:
http://jsfiddle.net/hMJx3/
Wouldn't a regex like
/([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g
do the trick? Obviously, this is based on the example string you gave, and if there are other patterns possible this should be updated... But if it is that fixed it's not so complicated.
Afterwards you just have to go through the captured groups for each match, and you'll have your data parsed. Live demo for fun: http://regex101.com/r/kF5zD3
Explanation
[^|\d] evrything but a pipe or a digit. This is to account for strange city name that [a-zA-Z ] might not catch
[-\d.] a digit, a dot or a hyphen
\([^)]+\) opening parenthesis, everything that isn't a closing parenthesis, closing parenthesis.
Quick incomplete pointers on regex
Here, the regex is the part between the /. The g after is a flag, thanks to it the regex won't stop after hitting the first match and will return every match
The match is what the whole expression will find. Here, the match will be everything between two | in your string. The capturing groups are a very useful tool that allows you too extract data from this match: they are delimited by parenthesis, which are a special character in regex. (a)b will match ab, the first captured group of this match will be a
[...] is means every character inside will do. [abc] will match a or b or c.
+ is a quantifier, another special character, meaning "one or more of what precedes me". a+ means "one or more a and will match aaaaa.
\d is a shortcut for [0-9] (yes, - is a special range character inside of [...]. That's why in [-\d.], which is equivalent to [-0-9.], it's directly following the opening bracket)
since parenthesis are special characters, when you actually want to match a parenthesis you need to escape: regex (\(a\))b will match (a)b, the first captured group of this match will be (a) with the parenthesis
? means what precedes is optional (zero or one instances)
^ when put at the beginning of a [...] statement means "everything but what's in the brackets". [^a]+ will match bcd-*ù but not aa
If you really know nothing about regex, as I believe they're the right tool for your case, I suggest your take a quick overview of a tuto, just to get a better idea of what you're dealing with. The way to set flags, loop through matches and their respective captured groups will depend on your language and how you call your regex.
[A-z][a-z]+( [A-z][a-z]+)* -?[0-9]+\.[0-9] [OU] \(-?[0-9]+\.[0-9]\) -?[0-9]+\.[0-9]
This should match a single part of your long string under the following assumptions:
The city consists only of alpha characters, each word starts with an uppercase character and is at least 2 characters long.
Numbers have an optional sign and exactly one digit after the decimal point
the single character is either O or U
Now it is up to you to:
Properly create capturing parentheses
Check whether my assumptions are right
In order to match the date:
\([JFMASOND][a-z]{2} [0-9]?[0-9], [0-9]{4} [0-9]{2}:[0-9]{2} [AP]M\)$

How to invert an existing regular expression in javascript?

I have created a regex to validate time as follows : ([01]?\d|2[0-3]):[0-5]\d.
Matches TRUE : 08:00, 09:00, 9:00, 13:00, 23:59.
Matches FALSE : 10.00, 24:00, 25:30, 23:62, afdasdasd, ten.
QUESTION
How to invert a javascript regular expression to validate if NOT time?
NOTE - I have seen several ways to do this on stack but cannot seem to make them work for my expression because I do not understand how the invert expression should work.
http://regexr.com?38ai1
ANSWER
Simplest solution was to invert the javascript statement and NOT the regex itself.
if (!(/^(([01]?\d|2[0-3]):[0-5]\d)/.test(obj.value))
Simply adding ! to create an if NOT statement.
A regular expression is usually used for capturing some specific condition(s) - the more specific, the better the regex. What you're looking for is an extremely broad condition to match because just about everything wouldn't be considered "time" (a whitespace, a special character, an alphabet character, etc etc etc).
As suggested in the comments, for what you're trying to achieve, it makes much more sense to look for a time and then check (and negate) the result of that regular expression.
As i mentioned in the comment, the better way is to negate the test rather then create a new regexp that matches any non-time.
However, if you really need the regexp, you could use negative lookahead to match the start of something that is not a time:
/^(?!([01]?\d|2[0-3]):[0-5]\d$)/
DEMO: http://regex101.com/r/bD3aG4
Note that i anchored the regexp (^ and $), which might not work with what you need it for.

Match altered version of first match with only one expression?

I'm writing a brush for Alex Gorbatchev's Syntax Highlighter to get highlighting for Smalltalk code. Now, consider the following Smalltalk code:
aCollection do: [ :each | each shout ]
I want to find the block argument ":each" and then match "each" every time it occurrs afterwards (for simplicity, let's say every occurrence an not just inside the brackets).
Note that the argument can have any name, e.g. ":myArg".
My attempt to match ":each":
\:([\d\w]+)
This seems to work. The problem is for me to match the occurrences of "each". I thought something like this could work:
\:([\d\w]+)|\1
But the right hand side of the alternation seems to be treated as an independent expression, so backreferencing doesn't work.
Is it even possible to accomplish what I want in a single expression? Or would I have to use the backreference within a second expression (via another function call)?
You could do it in languages that support variable-length lookbehind (AFAIK only the .NET framework languages do, Perl 6 might). There you could highlight a word if it matches (?<=:(\w+)\b.*)\1. But JavaScript doesn't support lookbehind at all.
But anyway this regex would be very inefficient (I just checked a simple example in RegexBuddy, and the regex engine needs over 60 steps for nearly every character in the document to decide between match and non-match), so this is not a good idea if you want to use it for code highlighting.
I'd recommend you use the two-step approach you mentioned: First match :(\w+)\b (word boundary inserted for safety, \d is implied in \w), then do a literal search for match result \1.
I believe the only thing stored by the Regex engine between matches is the position of the last match. Therefore, when looking for the next match, you cannot use a backreference to the match before.
So, no, I do not think that this is possible.

Categories

Resources