Having trouble with regular expressions - javascript

I actually have 2 questions. There appears to be a knowledge gap in my understanding of regular expressions, so I was wondering if somebody could help me out.
1)
function LongestWord(sen) {
var word = /[^\W\d]([a-z]+)[$\W\d]/gi;
var answer = word.exec(sen);
return answer;
}
console.log(LongestWord("9hi3"));
Why does this return [hi3, i] as opposed to [9hi3, hi] as intended. I am clearly stating that before a letter comes either the beginning, a number, or a non word character MUST be in my match. I also have the + symbol which being greedy should take the entire group hi.
2)
function LongestWord(sen) {
var word = /[\b\d]([a-z]+)[\b\d]/gi;
var answer = word.exec(sen);
return answer;
}
console.log(LongestWord("hi"));
More importantly, why does this return null. #1 was my attempted fix to this. But you get the idea of what I'm trying to do here.
PLEASE TELL ME WHAT IS WRONG WITH MY THINKING IN BOTH PROBLEMS RATHER THAN GIVING ME A SOLUTION. IF I DON'T LEARN WHAT I DID WRONG I WILL GO ON TO REPEAT THE SAME MISTAKES. Thank you!

Let's walk through your regular expressions, using your example string: 9hi3
1) [^\W\d]([a-z]+)[$\W\d]
First, we have [^\W\d]. Normally, ^ matches the start of the string, but when it is inside [], it actually negates that block. So, [^\W\d] actually means any one character that IS a word character, and not a digit. This obviously skips the 9, since that is a digit, and matches on the h.
The next part, ([a-z]+), matches what you are expecting, except the h was already matched, so it only matches the i.
Then, [$\W\d] is matching a $ symbol, a non-word character, or a digit. Notice that just like ^, the $ does NOT match the end of the string when inside the [].
2) [\b\d]([a-z]+)[\b\d]
For this one, you should start by looking at the documentation for exec to see why it can return null. Specifically:
If the match fails, the exec() method returns null.
So, you know that the match is failing. Why?
Again, your confusion is coming from not understanding how special characters change meaning when inside []. In this case, \b changes from matching a word-boundary, to matching a backspace character.
It is worth noting that your second regex will match the string you tested your first one with, 9hi3, because it begins and ends with digits. However, you tested it with hi.
I hope these explanations have helped you.
For future reference, you should take a look at the RegExp guide on MDN.
Also, a great tool for testing regular expressions is regexpal. I highly recommend using it to help you figure out exactly what your regular expressions are doing.

Related

JS RegEx for finding number of lines in a page, separated by form feed \f

I have a use case that requires a plain-text file to have lines to consist of at most 38 characters, and 'pages' to consist of at most 28 lines. To enforce this, I'm using regular expressions. I was able to enforce the line-length without any problems, but the page-length is proving to be much trickier.
After several iterations, I came to the following as a regular expression that I feel should work, but it isn't.
let expression = /(([^\f]*)(\r\n)){29,}\f/;
It simply results in no matches.
If anyone could provide some feedback, I'd greatly appreciate it! - Jacob
Edit 1 - removed code block around second expression, it was probably making my question confusing.
Edit 2 - removed following text, it's not pertinent:
As a comparison, the following expression results in a single match, the entire document. I'm assuming it's matching all lines up until the final
let expression = /(.*(\r\n)){29,}
Edit 3 - So after some thinking, I realized that my issue is due to the initial section of the regex that matches any characters before a newline is including newlines. Therefore, I believe I need to match any characters before a newline EXCEPT (\f\r\n). However, I'm now having trouble implementing this. I tried the following:
let expression = /([^\f^\r^\n]*(\r\n)){29,}\f/;
But it's also not matching. I'm assuming that my negations are wrong...
Edit 4 - I have the following regex that matches each line: let expression = /([^\f\r\n]{0,}(\r\n))/;
This is pretty close to what I want. All I need now is to match any instances of 29 or more lines followed by \f
Thanks for all the help to those who commented, a friend ended up helping me get the final regex
let expression = /([^\f\r\n]*?\r??\n){29,}?\f/;
Edit:
As you clarified more your problem, and provided your updated regex:
/([^\f^\r^\n]*(\r\n)){29,}\f/;
Your negations are not right here, use [^\f\r\n] instead of [^\f^\r^\n]. This will negate all of \f, \r, and \n.
So, your regex becomes:
/([^\f\r\n]*(\r\n)){29,}\f/;
This will match 29 or more lines of characters (that can be anything but \f, \r or \n), the whole thing followed by a single \f.
Original answer:
Your current regular expression:
let expression = /(([^\f]*)(\r\n)){29,}\f/;
Matches strings that consist of 29 or more lines (separated by \r\n), the whole thing followed by one single \f.
As far as I understood, you want each of your lines to end with \f. Did you mean to include the \f inside?
let expression = /(([^\f]*)(\r\n\f)){29,}/;

What does the regular expression, /(?!^)/g mean?

What does the regular expression, /(?!^)/g mean?
I can see that from here x(?!y): Matches x only if x is not followed by y.
That explains, if ?! is used before any set of characters, it checks for the not followed by condition.
But we have, ?!^. So, if we try to say it in words, it would probably mean not followed by negate. That really does not make me guess a probable statement for it. I mean negate of what?
I'm still guessing and could not reach a fruitful conclusion.
Help is much appreciated.
Thanks!
Circumflex ^ only implies negation in a character class [^...] (when comes as first character in class). Outside of it it means start of input string or line. A negative lookahead that contains ^ only, means match shouldn't be found at start of input string or line.
See it in action
It returns
all matches... (/g, global flag)
...which are not followed by... (x?!y, negative lookahead where x is the empty string)
...the position at the start of the string (^)
That is, any position between two characters except for the position at the start of the string, as you can see here.
This regex is useful, thus, for detecting empty strings (after all, applying the regex to an empty string will return no matches). Empty strings may be the result of splitting strings, and you probably don't want to do anything with them.

Why character class and capturing group show different results in javascript regexp for a whitespace character followed by a dot?

I was solving the chapter exercises from this book - http://eloquentjavascript.net/09_regexp.html
There is a question where I need to write a regular expression for a whitespace character followed by a dot, comma, colon, or semicolon.
I wrote this one
var re1 = /\s(.|,|:|;)/;
The book had this as answer
var re2 = /\s[.,;:]/;
I understand that the second one is correct, and it is more efficient. But leaving behind efficiency, the first one should also give correct results.
The first one doesn't give correct output for the following piece of code -
console.log(re1.test("escape the dot")); // prints true
It should have given "false" but it outputs the opposite. I couldn't understand this. I tried https://www.debuggex.com/ too, but the figure also seems to be okay!
It seems that I am missing some understanding from my end.
Just as I finished this question to post, I realised my mistake that was giving me the wrong output. So, I thought I would rather share both the question and answer here so as to help anyone who might face some similar problem in future.
The thing is the period (dot) itself, when used between square brackets, loses its special meaning. The same goes for other special characters, such as +.
But they retain their special meaning when used in a capturing group.
So, the code
var re1 = /\s(.|,|:|;)/;
console.log(re1.test("escape the dot")); // prints true
is rather looking for the pattern - a space followed by either a character that's not newline ( because of period ), or any of comma, colon, and semi-colon.
To get the correct output, the correct re, if used with capturing group, would be,
var re1 = /\s(\.|,|:|;)/;

Regular expression that matches a pattern but excludes a match when it starts with a certain character

I have a string of the format:
PATTERN or abcPATTERNdef or PATTERN or <<[TEST].[PATTERN]>>
I have created a regular expression (in JavaScript) as (^PATTERN)|([^\[]PATTERN) which is returning the first 3 occurrences while ignoring the last one, however I also seem to be getting the preceding character in the returned matches:
"PATTERN", "cPATTERN" and " PATTERN"
What I need are the matches without the preceding character.
I'm new to regular expressions and apologize if the question reflects that.
Any help would be greatly appreciated.
JavaScript doesn't have good support for lookbehinds, but if your format is going to remain the same, you can try something like this:
PATTERN(?!])
It matches all occurrences of "PATTERN" that are not followed by a ].
Let me know if this isn't the case, and I will update my answer to include the check for the opening [.
I'm a little confused as to what you're trying to do. Can you see if this works?
(^PATTERN)|[^\[](PATTERN)
Are you trying to find all the strings between the PATTERN's?

JavaScript and regular expressions: get the number of parenthesized subpattern

I have to get the number of parenthesized substring matches in a regular expression:
var reg=/([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g,
nbr=0;
//Some code
alert(nbr); //2
In the above example, the total is 2: only the first and the last couple of parentheses will create grouping matches.
How to know this number for any regular expressions?
My first idea was to check the value of RegExp.$1 to RegExp.$9, but even if there are no corresponding parenthseses, these values are not null, but empty string...
I've also seen the RegExp.lastMatch property, but this one represents only the value of the last matched characters, not the corresponding number.
So, I've tried to build another regular expression to scan any RegExp and count this number, but it's quite difficult...
Do you have a better solution to do that?
Thanks in advance!
Javascripts RegExp.match() method returns an Array of matches. You might just want to check the length of that result array.
var mystr = "Hello 42 world. This 11 is a string 105 with some 2 numbers 55";
var res = mystr.match(/\d+/g);
console.log( res.length );
Well, judging from the code snippet we can assume that the input pattern is always a valid regular expression, because otherwise it would fail before the some code partm right? That makes the task much easier!
Because We just need to count how many starting capturing parentheses there are!
var reg = /([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g;
var nbr = (' '+reg.source).match(/[^\\](\\\\)*(?=\([^?])/g);
nbr = nbr ? nbr.length : 0;
alert(nbr); // 2
And here is a breakdown:
[^\\] Make sure we don't start the match with an escaping slash.
(\\\\)* And we can have any number of escaped slash before the starting parenthes.
(?= Look ahead. More on this later.
\( The starting parenthes we are looking for.
[^?] Make sure it is not followed by a question mark - which means it is capturing.
) End of look ahead
Why match with look ahead? To check that the parenthes is not an escaped entity, we need to capture what goes before it. No big deal here. We know JS doens't have look behind.
Problem is, if there are two starting parentheses sticking together, then once we capture the first parenthes the second parenthes would have nothing to back it up - its back has already been captured!
So to make sure a parenthes can be the starting base of the next one, we need to exclude it from the match.
And the space added to the source? It is there to be the back of the first character, in case it is a starting parenthes.

Categories

Resources