RegEx for an autocomplete feature - javascript

I am writing a search bar with an autocomplete feature that is hooked up to an endpoint. I am using regex to determine the "context" that I am in inside of the query I type in the search bar. The three contexts are "attribute," "value," and "operator." The two operators that are allowed are "AND" and "OR." Below is an example of an example query.
Color: Blue AND Size: "Women's Large" (<-- multi-word values or attribute names are surrounded by quotation marks)
I need my regex to match after you put a space after Blue, and if the user begins type "A/AN/AND/O/OR", I need it to match. Once they have put a space after the operator, I need it to stop matching.
This is the expression I have come up with.
const contextIsOperator = /[\w\d\s"]+: *[\w\s\d"]+ [\w]*$/
It matches once I put a space after "Blue," but matches for everything I put after that. If I replace the last * in the expression with a +, it works when I put a space after "Blue" and start manually typing one of the operators, but not if I just have a space after "Blue."
The pattern I have in my head written in words is:
group of one or more characters/digits/spaces/quotation marks
followed by a colon
followed by an optional space
followed by another group of one or more characters/digits/space/quotation marks
followed by a space (after the value)
followed by one or more characters (this is the operator)
How do I solve this problem?

Change [\w]* to something that just matches AND, OR, or one of their prefixes. Then you can make it optional with ?
[\w\s"]+: *[\w\s"]+ (A|AN|AND|O|OR)?$
DEMO
Note that Size: Women's Large won't match this because the apostrophe isn't in \w; that only matches letters, digits, and underscore. You'll need to add any other punctuation characters that you want to allow in these fields to the character set.

Is as, your language is not deterministic enough to be properly modeled with a regex. That being said, there are 2 approaches you can take:
Require all values (the stuff after a : and before an operator) to be enclosed in quotes
Build a simple state machine that can parse the data more intelligently. (Google Finite State Machine Parser)
If you choose to use the first method, you can use the following regex:
^(("?[\w\s]+"?): ?("[\w\s']+")( (AND|OR) )?)+$
I would explain the different components, but regex101 already does for me with really good visuals and detail.

Edit: this is the final one, check the unit tests here
const regex = /((("[\w\s"'']+(?="\b))"|[\w"'']+):\s?(("[\w\s"'']+(?="\b))"|[\w"'']+)\s(AND|OR)(?=\b\s))+/
That monstrosity should match (NOTE: QUOTED KEYS/VALUES MUST BE DOUBLE QUOTED):
Color: Blue AND "Size5":"Women's Large"
"weird KEy":regularvalue OR otherKey: "quoted value"

Here you go, try this out
^(?:"[^"]*"|[^\s:]+):[ ](?:"[^"]*"|[^\s:]+)[ ](?:A(?:N(?:D(?:[ ](*SKIP)(?!))?)?)?|O(?:R(?:[ ](*SKIP)(?!))?)?)?
https://regex101.com/r/neUQ0g/1
Explained
^ # BOS
(?: # Attribute
"
[^"]*
"
|
[^\s:]+
)
:
[ ]
(?: # Value
"
[^"]*
"
|
[^\s:]+
)
[ ] # Start matching after Attribute: Value + space
(?: # Operator
A
(?:
N
(?:
D
(?: # Stop matching after 'AND '
[ ]
(*SKIP)
(?!)
)?
)?
)?
|
O
(?:
R
(?: # Stop matching after 'OR '
[ ]
(*SKIP)
(?!)
)?
)?
)?

Related

Regular Expression to match text between # and only if # is not preceded by '

Hello I'm trying to find a regular expression that can help me find all matches inside a string when they're inside # and only if # are not preceded by an apostrophe "'".
Basically I need to bold the text just as here when we use double * to bold text like this, but the apostrophe should work as an escape character.
For example
#Hello my name is Noé# should look like Hello my name is Noé
#Hello this has an escape apostrophe '# so I'll match until here# should look like Hello this has an escape apostrophe '# so I'll match until here
Inside a long text there might or might not be several matches:
"Hello I'm a text #I'm bold#, and I need to know how to match my text that's inside two '#, and #I will not match either 'cause I got no end"
So i can print it like
"Hello I'm a text I'm bold, and I need to know how to match my text that's inside two '#, and #I will not match either 'cause I got no end"
If thats not possible with a RegExp I could program a finite state machine, but I was hoping I was possible, thank you in advance God bless you!
Note: I will handle the escape characters later by now I just need to know how to mach this
/(?<!')#.*(?<!')#/gim
This was the only thing I could come up with, but honestly, I have no idea how negative look behind works :(, with this regexp it would match wrong. For example, if I type:
"I'm a text #and I should be a match# and this should not #But this should as well# and I'm just some random extra text"
matches from the first # occurrence until the last one, like so:
"I'm a text #and I should be a match# and this should not #But this should as well# and I'm just some random extra text"
I think this should work:
(?<!')#(.*?)(?<!')#
Here you can see the regexp working with your examples: https://regex101.com/r/wnguiA/1
(?<!') is Negative Lookbehind, it tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a b that is not preceded by an a.
More easy is the (.*?) that matches any character (except for line terminators); adding ? tells the capturing group to be not-greedy and stop at the first occourence of the succesive token.
To prevent triggering the negatilve lookbehind at all the positions not asserting a ' to the left, you can also first match # and do the assertion after it.
#(?<!'#)(.*?)#(?<!'#)
Regex demo
Another option instead of using the non greedy .*? is to use a negated character class matching any char except #
Then when you encounter # only match it if there is ' before it using a positive lookbehind.
#(?<!'#)([^#\n]*(?:#(?<='#)[^#\n]*)*)#(?<!'#)
#(?<!'#) Match # not directly preceded by '
( Capture group 1
[^#\n]* Optionally match any char except # or a newline
(?: Non capture group
#(?<='#) Match # not directly preceded by '
[^#\n]* Match optional repetitions of any char except # or a newline
)* Close non capture group and optionally repeat it to match all occurrences
) Close group 1
#(?<!'#) Match # not directly preceded by '
Regex demo

Regular expression to match line separated size strings

I am writing a reular expression to validate input string, which is a line separated list of sizes ([width]x[height]).
Valid input example:
300x200
50x80
100x100
The regular expression I initially came up with is (https://regex101.com/r/H9JDjA/1):
^(\d+x\d+[\r\n|\r|\n]*)+$
This regular expression matches my input but also matches this invalid input (size can't be 100x100x200):
300x200
50x80
100x100x200
Adding a word boundary at the end seems to have fixed this issue:
^(\d+x\d+[\r\n|\r|\n]*\b)+$
My questions:
Why does the initial regular expression without the word boundary fail? It looks like I am matching one or more instances of a \d+(number), followed by character 'x', followed by a \d+(number), followed by one or more new lines from various operating systems.
How to validate input having multiple training new line characters in this input? The following doesn't work for some kind of inputs like this:
500x500\n100x100\n\n\n384384
^(\d+x\d+[\r\n|\r|\n]\b)+|[\r\n|\r|\n]$
Isolate the problem with this target 100x100x200
For now, forget about the anchors in the regex.
The minimum regex is \d+x\d+ since it only has to be satisfied once
for a match to take place.
The maximum is something like this \d+x\d+ (?: (?:\r?\n | \r)* \d+x\d+ )*
Since \r?\n|\r is optional, it can be reduced to this \d+x\d+ (?: \d+x\d+ )*
The result, when you applied to the target string is:
100x100x200 matches.
But, since you've anchored the regex ^$, it is forced to break up
the middle 100 to make it match.
100x10 from \d+x\d+
0x200 from (?: \d+x\d+ )*
So, that is why the first regex seemingly matches 100x100x200.
To avoid all of that, just require a line break between them, and
make the trailing linebreaks optional (if you need to validate the whole
string, otherwise leave it and the end anchor off).
^\d+x\d+(?:(?:\r?\n|\r)+\d+x\d+)*(?:\r?\n|\r)*$
A better view of it
^
\d+ x \d+
(?:
(?: \r? \n | \r )+
\d+ x \d+
)*
(?: \r? \n | \r )*
$
Your initial regular expression "fails" because of the +:
^(\d+x\d+[\r\n|\r|\n]*)+$
-----------------------^ here
Your parenthesis pattern (\d+x\d+[\r\n|\r|\n]*) says match one or more number followed by an "x" followed by one or more number followed by zero or more newlines. The + after that says match one or more of the entire parenthesis pattern, which means that for an input like 100x200x300 your pattern matches 100x200 and then 200x300, so it looks like it matches the entire line.
If you're simply trying to extract dimensions from a newline-separated string, I would use the following regular expression with a multiline flag:
^(\d+x\d+)$
https://regex101.com/r/H9JDjA/2
Side note: In your expression, [\r\n|\r|\n] is actually saying match any one instance of \r, \n, |, \r, |, or \n (i.e. it's quite redundant, and you probably aren't meaning to match |). If you want to match a sequential set of any combination of \r or \n, you can simply use [\r\n]+.
You can use multiline modifier, which should make life easier:
var input = "\n\
300x200x400\n\
50x80\n\
\n\
\n\
300x200\n\
50x80\n\
100x100x200x100\n";
var allSizes = input.match(/^\d+x\d+/gm); // multiline modifier assumes each line has start and end
for (var size in allSizes)
console.log(allSizes[size]);
Prints:
300x200
50x80
300x200
50x80
100x100
Try this regex out
^[0-9]{1,4}x[0-9]{1,4}|[(\r\n|\r|\n)]+$
It'll match these inputs.
1x1
10x10
100x100
2000x2938
\n
\r
\r\n
but not this 100x100x200

Excluding URLS that contain a string in Regex

I am using regex to add a survey to pages and I want to include it on all pages except payment and signin pages. I can't use look arounds for the regex so I am attempting to use the following but it isn't working.
^/.*[^(credit|signin)].*
Which should capture all urls except those containing credit or signin
[ indicates the start of a character class and the [^ negation is per-character. Thus your regular expression is "anything followed by any character not in this class followed by anything," which is very likely to match anything.
Since you are using specific strings, I don't think a regular expression is appropriate here. It would be a lot simpler to check that credit and signin don't exist in the string, such as with JavaScript:
-1 === string.indexOf("credit") && -1 === string.indexOf("signin")
Or you could check that a regular expression does not match
false === /credit|signin/.test(string)
Whitelisting words in regex is generally pretty easy, and usually follows a form of:
^.*(?:option1|option2).*$
The pattern breaks down to:
^ - start of string
.* - 0 or more non-newline characters*
(?: - open non-capturing group
option1|option2 - | separated list of options to whitelist
) - close non-capturing group
.* - 0 or more non-newline characters
$ - end of string
Blacklisting words in a regex is a bit more complicated to understand, but can be done with a pattern along the lines of:
^(?:(?!option1|option2).)*$
The pattern breaks down to:
^ - start of string
(?: - open non-capturing group
(?! - open negative lookahead (the next characters in the string must not match the value contained in the negative lookahead)
option1|option2 - | separated list of options to blacklist
) - close negative lookahead
. - a single non-newline character*
) - close non-capturing group
* - repeat the group 0 or more times
$ - end of string
Basically this pattern checks that the values in the blacklist do not occur at any point in the string.
* exact characters vary depending on the language, so use caution
The final version:
/^(?:(?!credit|signin).)*$/

JavaScript regular expression to remember last bracket type encountered

In JavaScript want to be able to match text that is:
(surrounded by parentheses)
[surrounded by square brackets]
not surrounded by either type of bracket
In the following expression...
none[square](round)(accept]able)[wrong).text
... there should be 4 matches, for none, [square], (round) and (accept]able). However [wrong) should not match because there is no closing ] to be found.
In my best attempt so far...
([([])[A-Za-z]+[\])]|[^\[()\]]+
... (accept], able and [wrong) are incorrectly matched, while (accept]able) as a whole is not matched. I'm not too concerned about (accept]able); I would prefer no match at all to a match with imbalanced brackets.
I am guessing that I need to replace the [\])] expression with one that checks the value of the initial matching group, and uses ) if the first match was ( or ] if the first match was [.
I have tried working with conditional expressions. These seem to work well in PCRE and Python, but not in JavaScript.
Is this a problem that can be solved in a JavaScript regular expression on its own, or will I have to handle this piecemeal in a bulky JavaScript function?
A way to do that consists to match the two cases (acceptable and non-acceptable) and to separate the results in two different capture groups. So whatever you need to do with the results you only have to test which group succeeds:
/(\[[^\]]*\]|\([^)]*\)|[a-z]+)|([\[(][\s\S]*?(?:[\])]|$))/gi
pattern details:
( # acceptable capture group
\[ [^\]]* \]
|
\( [^)]* \)
|
[a-z]+
)
|
( # non-acceptable capture group
[\[(] [\s\S]*? (?: [\])] | $ ) # unclosed parens
)
This pattern doesn't care if a square bracket is enclosed between round brackets and vice-versa, but you can easily be more constrictive with this pattern that forbids any other brackets between brackets (square or round):
( # acceptable capture group
\[ [^()\[\]]* \]
|
\( [^()\[\]]* \)
|
[a-z]+
)
|
( # non-acceptable capture group
[\[(] [\s\S]*? (?: [\])] | $ ) # unclosed parens
)
Note about these two patterns: You can choose the default behavior when a unclosed bracket is found. The two patterns are designed to stop the non-acceptable part at the first closing bracket or if not found at the end of the string, but you can change this behavior and choose that an unclosing bracket stops always at the end of the string like this: [\[(][\s\S]*$
I'm not quite sure if I get all of the possible strings, but maybe this does the trick?
/\[([A-Za-z]*)\]|\(([\]A-Za-z]*)\)/gm
You can use the following :
/^(\[[^\[]+?\]|\([^\(]+?\)|[^\[\(]+)$/gm
See DEMO
This will do it for you:
\((\w*\s*)\)|\[(\w*)\]|\((\w*\s*|\])*\)|\((\w*\s*|\[)*\)|\[(\w*\s*|\()*\]|\[(\w*\s*|\))*\]|^\b\w*\s*\b
Demo here:
https://regex101.com/r/mV6gD2/2

Javascript multiple regex pattern

I'm trying to exclude some internal IP addresses and some internal IP address formats from viewing certain logos and links in the site.I have multiple range of IP addresses(sample given below). Is it possible to write a regex that could match all the IP addresses in the list below using javascript?
10.X.X.X
12.122.X.X
12.211.X.X
64.X.X.X
64.23.X.X
74.23.211.92
and 10 more
Quote the periods, replace the X's with \d+, and join them all together with pipes:
const allowedIPpatterns = [
"10.X.X.X",
"12.122.X.X",
"12.211.X.X",
"64.X.X.X",
"64.23.X.X",
"74.23.211.92" //, etc.
];
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
const allowedRegexp = new RegExp(allowedRegexStr);
Then you're all set:
'10.1.2.3'.match(allowedRegexp) // => ['10.1.2.3']
'100.1.2.3'.match(allowedRegexp) // => null
How it works:
First, we have to turn the individual IP patterns into regular expressions matching their intent. One regular expression for "all IPs of the form '12.122.X.X'" is this:
^12\.122\.\d+\.\d+$
^ means the match has to start at the beginning of the string; otherwise, 112.122.X.X IPs would also match.
12 etc: digits match themselves
\.: a period in a regex matches any character at all; we want literal periods, so we put a backslash in front.
\d: shorthand for [0-9]; matches any digit.
+: means "1 or more" - 1 or more digits, in this case.
$: similarly to ^, this means the match has to end at the end of the string.
So, we turn the IP patterns into regexes like that. For an individual pattern you could use code like this:
const regexStr = `^` + ipXpattern.
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
`$`;
Which just replaces all .s with \. and Xs with \d+ and sticks the ^ and $ on the ends.
(Note the doubled backslashes; both string parsing and regex parsing use backslashes, so wherever we want a literal one to make it past the string parser to the regular expression parser, we have to double it.)
In a regular expression, the alternation this|that matches anything that matches either this or that. So we can check for a match against all the IP's at once if we to turn the list into a single regex of the form re1|re2|re3|...|relast.
Then we can do some refactoring to make the regex matcher's job easier; in this case, since all the regexes are going to have ^...$, we can move those constraints out of the individual regexes and put them on the whole thing: ^(10\.\d+\.\d+\.\d+|12\.122\.\d+\.\d+|...)$. The parentheses keep the ^ from being only part of the first pattern and $ from being only part of the last. But since plain parentheses capture as well as group, and we don't need to capture anything, I replaced them with the non-grouping version (?:..).
And in this case we can do the global search-and-replace once on the giant string instead of individually on each pattern. So the result is the code above:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
That's still just a string; we have to turn it into an actual RegExp object to do the matching:
const allowedRegexp = new RegExp(allowedRegexStr);
As written, this doesn't filter out illegal IPs - for instance, 10.1234.5678.9012 would match the first pattern. If you want to limit the individual byte values to the decimal range 0-255, you can use a more complicated regex than \d+, like this:
(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])
That matches "any one or two digits, or '1' followed by any two digits, or '2' followed by any of '0' through '4' followed by any digit, or '25' followed by any of '0' through '5'". Replacing the \d with that turns the full string-munging expression into this:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '(?:\\d{1,2}|1\\d{2}|2[0-4]\\d|25[0-5])') +
')$';
And makes the actual regex look much more unwieldy:
^(?:10\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5]).(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])|12\.122\....
but you don't have to look at it, just match against it. :)
You could do it in regex, but it's not going to be pretty, especially since JavaScript doesn't even support verbose regexes, which means that it has to be one humongous line of regex without any comments. Furthermore, regexes are ill-suited for matching ranges of numbers. I suspect that there are better tools for dealing with this.
Well, OK, here goes (for the samples you provided):
var myregexp = /\b(?:74\.23\.211\.92|(?:12\.(?:122|211)|64\.23)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])|(?:10|64)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]))\b/g;
As a verbose ("readable") regex:
\b # start of number
(?: # Either match...
74\.23\.211\.92 # an explicit address
| # or
(?: # an address that starts with
12\.(?:122|211) # 12.122 or 12.211
| # or
64\.23 # 64.23
)
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
| # or
(?:10|64) # match 10 or 64
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
)
\b # end of number
/^(X|\d{1,3})(\.(X|\d{1,3})){3}$/ should do it.
If you don't actually need to match the "X" character you could use this:
\b(?:\d{1,3}\.){3}\d{1,3}\b
Otherwise I would use the solution cebarrett provided.
I'm not entirely sure of what you're trying to achieve here (doesn't look anyone else is either).
However, if it's validation, then here's a solution to validate an IP address that doesn't use RegEx. First, split the input string at the dot. Then using parseInt on the number, make sure it isn't higher than 255.
function ipValidator(ipAddress) {
var ipSegments = ipAddress.split('.');
for(var i=0;i<ipSegments.length;i++)
{
if(parseInt(ipSegments[i]) > 255){
return 'fail';
}
}
return 'match';
}
Running the following returns 'match':
document.write(ipValidator('10.255.255.125'));
Whereas this will return 'fail':
document.write(ipValidator('10.255.256.125'));
Here's a noted version in a jsfiddle with some examples, http://jsfiddle.net/VGp2p/2/

Categories

Resources