Regex removing extension and numbers - javascript

I'm close to becoming a level 3 Regex Sorcerer (where I can find hidden traps and have a pet owl or bat), but I still need some help getting there...
The following works for the first two cases but fails for the third. I tried making the digits greedy but the whole thing fell over and I don't know where I'm going wrong.
Can you please help?
alert(removeNumberAndExtension("file 01.txt")) // works
alert(removeNumberAndExtension("file_01.txt")) // works
alert(removeNumberAndExtension("file.txt")) // fails
function removeNumberAndExtension(fname)
{
var rexp = new RegExp(/\s*\d+\.[a-zA-Z]+/g)
return fname.replace(rexp, "")
}

It's because of \d+: "one or more digits".
You need \d*: "zero or more digits".
Files extensions can also have digits (e.g. ".mp3"), so use [a-zA-Z0-9].
You should add the "end of the string" anchor ($), which makes the global flag (g) useless.
All these together: /\s*\d*\.[a-zA-Z0-9]+$/ :)

Related

Regex to find web addresses in short copy

Having a short copy I need to match all occurrences of links to websites. To keep things simple a need to find out addresses in this format:
www.aaaaaa.bbbbbb
http://aaaaaa.bbbb
https://aa.bbbb
but also I need to take care of longer www/http/https versions:
www.aaaaa.bbbb.ccc.ddd.eeee
etc. So basically number of subdomains is not known. Now I came up with this regex:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]
If you test on:
this is some tex with www.somewIebsite.dfd.jhh.hjh inside of it or maybe http://www.ssss.com or maybe https://evenore.com hahaah blah
It works fine with exception of when address is at the very end. $ seems to work only when there is \n in the end and it fails for:
this is some tex with www.somewIebsite.dfd.jhh.hjh
I'm guessing fix is simple and I miss something obvious so how would I fix it? BTW I posted regex here if yu want to quickly play around https://regex101.com/r/eL1bI4/3
The problem is that you placed the end anchor $ inside the character group []
[\s|,|$]
It is then interpreted literally as a dollar sign, and not as the anchor (the pipe character | is also interpreted literally, it's not needed there). The solution is to move the $ anchor outside:
(?:[\s,]|$)
However, in this case it makes more sense to use a positive lookahead instead of the noncapturing group (you don't want trailing spaces, or commas):
(?=[\s,]|$)
In the result you will end up with the following regex pattern:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)
See the working demo.
The updated version that handles trailing full stops:
(www\.([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)
See the working demo.

why this regexp returns match?

http://jsfiddle.net/sqee98xr/
var reg = /^(?!managed).+\.coffee$/
var match = '20150212214712-test-managed.coffee'.match(reg)
console.log(match) // prints '20150212214712-test-managed.coffee'
I want to match regexp only if there is not word "managed" present in a string - how I can do that?
Negative lookaheads are weird. You have to match more than just the word you are looking for. It's weird, I know.
var reg = /^(?!.*managed).+\.coffee$/
http://jsfiddle.net/sqee98xr/3/
EDIT: It seems I really got under some people's skin with the "weird" descriptor and lay description. It's weird because on a surface level the term "negative lookahead" implies "look ahead and make sure the stuff in these parenthesis isn't up there, then come back and continue matching". As a lover of regex, I still proclaim this naming is weird, especially to first time users of the assertion. To me it's easier to think of it as a "not" operator as opposed to something which actually crawls forward and "looks ahead". In order to get behavior to resemble an actual "look ahead", you have to match everything before the search term, hence the .*.
An even easier solution would have been to remove the start-of-string (^) assertion. Again, to me it's easier to read ?! as "not".
var reg = /(?!managed).+\.coffee$/
While #RyanWheale's solution is correct, the explanation isn't correct. The reason essentially is that a string that contains the word "managed" (such as "test-managed" ) can count as not "managed". To get an idea of this first lets look at the regular expression:
/^(?!managed).+\.coffee$/
// (Not "managed")(one or more characters)(".")("coffee")
So first we cannot have a string with the text "managed", then we can have one or more characters, then a dot, followed by the text "coffee". Here is an example that fulfills this.
"Hello.coffee" [ PASS ]
Makes sense, "Hello" certainly is not "managed". Here is another example that works from your string:
"20150212214712-test-managed.coffee" [ PASS ]
Why? Because "20150212214712-test-managed" is not the string "managed" even though it contains the string, the computer does not know that's what you mean. It thinks that "20150212214712-test-managed" as a string that isn't "managed" in the same way "andflaksfj" isn't "managed". So the only way it fails is if "managed" was at the start of the string:
"managed.coffee" [ FAIL ]
This isn't just because the text "managed" is there. Say the computer said that "managed." was not "managed". It would indeed pass the (?!managed) part but the rest of the string would just be coffee and it would fail because there is no ".".
Finally the solution to this is as suggested by the other answer:
/^(?!.*managed).+\.coffee$/
Now the string "20150212214712-test-managed.coffee" fails because no matter how it's looked at: "test-managed", "-managed", "st-managed", etc. Would still count as (?!.*managed) and fail. As in the example above this one it could try adding a sub-string from ".coffee", but as explained this would cause the string to fail in the rest of the regexp ( .+\.coffee$ ).
Hopefully this long explanation explained that Negative look-aheads are not weird, just takes your request very literally.

Recognizing Multple Sets of Opening Tag, Body, and Closing Tag in Regex

I'm attempting to develop a function that converts [bold]...[/bold] into <b>...</b> for my forum. As of right now, this works perfectly when there is only one set of [bold]...[/bold], however, whenever a second one is added, the regex does not recognize the end of the first [bold] until the second.
To illustrate, if I were to put "[bold]Hello[/bold], how are you [bold]today[/bold]?", I should get this:
Hello, how are you today?
However, what I'm actually getting is this:
Hello[/bold], how are you [bold]today?
Below is my current function:
function formatBolds(str) {
output = output.replace(/(\[bold\])(.*)(\[\/bold\])/g, "<b>$2</b>");
}
All you need it to change the .* to non greedy .*?
function formatBolds(str) {
output = output.replace(/(\[bold\])(.*?)(\[\/bold\])/g, "<b>$2</b>");
}
The problem with .* is that its greedy and tries to match as many as characters as possible, includeing the first [/bold] and following [bold] until the last [/bold]. The ? makes it non greedy, and halts the matching at the minimum length match
I think that using a non-greedy match should solve your problem:
(\[bold\])(.*?)(\[\/bold\])

Regex to find if there is only one block of code

My input is a string, I want to validate that there is only one first-level block of code.
Examples :
{ abc } TRUE
{ a { bc } } TRUE
{ a {{}} } TRUE
{ abc {efg}{hij}} TRUE
{ a b cde }{aa} FALSE
/^\{.*\}$/ is valid for the 5 cases, can you help me to find a regex invalid for the last case ?
Language is JavaScript.
EDIT: I started writing the answer before JavaScript was specified. Will leave it as for the record as it fully explains the regex.
In short: In JavaScript I cannot think of a reliable solution. In other engines there are several options:
Recursion (on which I will expand below)
Balancing group (.NET)
For solutions 2 (which anyhow won't work in JS either), I'll refer you to the example in this question
Recursive Regex
In Perl, PCRE (e.g. Notepad++, PHP, R) and the Matthew Barnett's regex module for Python, you can use:
^({(?:[^{}]++|(?1))*})$
The idea is to match exactly one set of nested braces. Anything more makes the regex fail.
See what matches and fails in the Regex Demo.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The outer parentheses define Group 1 (or Subroutine 1)
{ match the opening brace
(?: ... )* zero or more times, we will...
[^{}]++ match any chars that are not { or }
OR |
(?1) repeat the expression of subroutine 1
} match closing brace
The $ anchor asserts that we are at the end of the string. Therefore,
This is a terrible workaround.
Since this is in Javascript there's not really much to do, but please see the following regex:
/^{([^{}]*|{})*}$/
Where you copy ([^{}]*|{})* and insert it between the last pair of curly brackets (rinse and repeat). Every duplication of this pattern allows another level of nesting between your elements. (This is a workaround for the lack of recursion in JS regex, required to solve nesting problems.)
Online Regex Demo
In JavaScript what you need to do is strip out all the nested blocks until no nested blocks are left and then check whether there are still multiple blocks:
var r = input.replace(/(['"])(?:(?!\1|\\).|\\.)*\1|\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|\/\/[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\//gi, '');
if (r.split('{').length != r.split('}').length || r.indexOf('}') < r.indexOf('{')) {
// ERROR
continue;
}
while (r.match(/\{[^}]*\{[^{}]*\}/))
r = r.replace(/(\{[^}]*)\{[^{}]*\}/g, '$1');
if (r.match(/\}.*\{/)
// FALSE
else
// TRUE
Working JSFiddle
Be sure to make the regex in the while and the regex in the replace match the same otherwise this might result in infinite loops.
Updated to address ERROR cases and remove anything in comments, strings and regex-literals first after Unihedron asked.
(\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*
Code for brackets

Why is regex failing when input contains a newline?

I've inherited this javascript regex from another developer and now, even though nothing has changed, it doesn't seem to match the required text. Here is the regex:
/^.*(already (active|exists|registered)).*$/i
I need it to match any text that looks like
stuff stuff already exists more stuff etc
It looks perfectly fine to me, it only looks for those 2 words together and should in theory ignore the rest of the string. In my script I check the text like this
var cardUsedRE = /^.*(already (active|exists|registered)).*$/i;
if(cardUsedRE.test(responseText)){
mdiv.className = 'userError';
mdiv.innerHTML = 'The card # has already been registered';
document.getElementById('cardErrMsg').innerHTML = arrowGif;
}
I've stepped through this in FireBug and I've seen it fail to test this string:
> Error: <detail>Card number already registered for CLP.\n</detail>
Am I missing something? What is the likely issue with this?
Here's a simplified but functionally-equivalent regex that should handle newlines:
/(already\s+(active|exists|registered))/i
Not sure why you'd ever want to lead with ^.* or end with .*$ unless your goal is specifically to prevent newlines. Otherwise it's just superfluous.
EDIT: I replaced the space with \s+ so it will be more liberal with how it handles whitespace (e.g. one space, two spaces, a tab, etc. should all match).
tldr; Use the m modifier to make . match newlines. See the MDC regular expression documentation.
Failing (note the "\n" in the string literal):
var str = "Error: <detail>Card number already registered for CLP.\n</detail>"
str.match(/^.*(already (active|exists|registered)).*$/i)
Working (note m flag for "multi-line" behavior of .):
var str = "Error: <detail>Card number already registered for CLP.\n</detail>"
str.match(/^.*(already (active|exists|registered)).*$/mi)
I would use a simpler form, however: (Adjust for definition of "space".)
var str = "Error: <detail>Card number already registered for CLP.\n</detail>";
str.match(/(?:already\s+(?:active|exists|registered))/i)
Happy coding.

Categories

Resources