The significance of space in this JS regexp? - javascript

I've been learning some Javascript regular expressions today and I'm failing to understand how the following code works.
var toswop = 'last first\nlast first\nlast first';
var swapped = text.replace(/([\w]+)\b([\w ]+)/g,'$2 $1');
alert(swapped);
It correctly alerts the words swapped round in to the correct sequence however the following code (note the missing space after the second \w) doesn't work. It just print them in the original order.
var toswop = 'last first\nlast first\nlast first';
var swapped = text.replace(/([\w]+)\b([\w]+)/g,'$2 $1');
alert(swapped);

From the MDN:
\w
Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_].
For example, /\w/ matches 'a' in "apple," '5' in "$5.28,"
and '3' in "3D."
When you add a space, you change the character set from alphanumerics and an underscore to alphanumerics and an underscore and a space.

I think you are incorrectly using '\b' to match with a space, but in JavaScript regular expressions '\b' matches with a beginning or end of word.
Therefore this /([\w]+)\b/ part of the regular expression match only upto the end of word 'last'. remaining string is ' first' (note the space at the beginning).
Then to match with the remainder you need this ([\w ]+), this translates into 'One or more occurances of anyword character or space character'. which is exactly what we need to match with the remainder string ' first'.
You can note that even when the words are swapped, there is a space before the word 'first'.
To prove this further: imagine you changed your input to :
var toswop = 'last first another\nlast first another\nlast first another';
You can see your swapped text becomes
first another last
first another last
first another last
That is because last segment of the regular expression ([\w ]+) kept matching with both spaces and word characters and included the word 'another' into the match.
But if you remove the space from square brackets, then it won't match with the remainder ' first', because its not a string of 'word character' but a 'space' + string of 'word character'.
That is why you space is significant here.
But if you change your regex like this:
swapped = toswop.replace(/([\w]+)\s([\w]+)/g,'$2 $1');
Then it works without the space because \s in the middle with match with the space in the middle of two words.
Hope this clarifies your question.
See here for JavaScript RegEx syntax: http://www.w3schools.com/jsref/jsref_regexp_begin.asp
See here for my fiddle if you want to experiment more: http://jsfiddle.net/BuddhiP/P5Jqm/

It goes through the expression like this. It will look for all word characters until a non-word character is found. That catches the first word.
Then it looks for the next match which is a space or a word character. So without the space in the square brackets the space in the name isn't matched. That is why it's failing for the alternative without the space.
I think it's better to write this explicitly putting the space in rather than the \b.

Related

I'm confused by how RegEx distinguishes between apostrophes and single quotes

I’m trying to better understand RegEx and apostrophes/single quotes.
If I use this code:
const regex = /\b[\'\w]+\b/g
console.log(phrase.match(regex))
Then
let phrase = "'pickle'" // becomes pickle (single quotes disappear)
let phrase = "can't" // becomes can't (apostrophe remains)
I thought I knew what all regex do:
/text/g means everything between the slashes and g means global, to
keep searching after the first hit.
\b is word boundary, spaces on each side
w+ means alphanumerics, and the '+' indicates it can be for more
than 1 character
[\w\']+ means A-Za-z0-9 and apostrophe of any length.
But I'd like to get this:
let phrase = "'pickle'" // becomes 'pickle' (with single quotes)
What am I missing? I experimented with
const regex2 = /\b[\w+\']\b/g;
console.log(phrase.match(regex2))
let phrase = "can't"
But that becomes ["'", "t"] ... why? I understand now that the + is after the \w, the \' stands alone, but why "t" and where did the "can" go?
I tried
const regex3 = /\b\'[\w+]\'\b/g;
console.log(phrase.match(regex3))
But I get "null". Why?
The question is basically "How do I get word boundaries including apostrophes". Right?
If so, then the regex you have /\b[\'\w]+\b/g explicitly looks for \b for boundary which will match a non word character (like space or apostrohpe) followed by a letter or viceversa. Like this: https://regex101.com/r/7Pxsru/1, (I added a few more words so that the boundary is clearly seen)
If you would like to get "'pickle'" and "can't" then simply don't look for \b, like this /[\w+\']+/g, see the demo: https://regex101.com/r/FNjlEq/1
The two regexes you propose mean the following:
/\b[\w+\']\b/g: Look for a boundary letter then any word letter any number of times (note that this has no effect since it is inside a []) OR an apostrophe then a boundary.
/\b\'[\w+]\'\b/g: Look for a boundary letter by an apostrophe and any word any number of times (note that there is no need to be inside a []) then followed by an apostrophe and a word boundary.
const regex2 = /\b[\w+\']\b/g;
In this one, since the + in inside of [], it is matching a literal + character, so you're searching for a word boundary, followed by either a single alphanumeric character, a +, or a ', following by a word boundary.
You probably want:
\b(\w+|\')\b
which looks for a word boundary, followed by either at least one alphanumeric character or a single quote.
It would probably help to look at regex101 so you can see what the regex is actually doing: https://regex101.com/r/aJPWAB/1

JS & Regex: how to replace punctuation pattern properly?

Given an input text such where all spaces are replaced by n _ :
Hello_world_?. Hello_other_sentenc3___. World___________.
I want to keep the _ between words, but I want to stick each punctuation back to the last word of a sentence without any space between last word and punctuation. I want to use the the punctuation as pivot of my regex.
I wrote the following JS-Regex:
str = str.replace(/(_| )*([:punct:])*( |_)/g, "$2$3");
This fails, since it returns :
Hello_world_?. Hello_other_sentenc3_. World_._
Why it doesn't works ? How to delete all "_" between the last word and the punctuation ?
http://jsfiddle.net/9c4z5/
Try the following regex, which makes use of a positive lookahead:
str = str.replace(/_+(?=\.)/g, "");
It replaces all underscores which are immediately followed by a punctuation character with the empty string, thus removing them.
If you want to match other punctuation characters than just the period, replace the \. part with an appropriate character class.
JavaScript doesn't have :punct: in its regex implementation. I believe you'd have to list out the punctuation characters you care about, perhaps something like this:
str = str.replace(/(_| )+([.,?])/g, "$2");
That is, replace any group of _ or space that is immediately followed by punctation with just the punctuation.
Demo: http://jsfiddle.net/9c4z5/2/

How do I need to write this RegEx to match the given test case? (don't match the ending period)

regex:
/#([\S]*?(?=\s)(?!\. ))/g
given string:
'this string has #var.thing.me two strings to be #var. replaced'.replace(/#([\S]*?(?=\s)(?!\. ))/g,function(){return '7';})
expected result:
'this string has 7 two strings to be 7. replaced'
In case you want to make it "better" I'm trying to match Razor Html Encoded Expressions but mind the case about not matching an ending period followed by a space. The test case above shows that with the second (shorter) #var, whereas the first captures as #var.thing.me
Try with following regex:
var input = 'this string has #var.thing.me two strings to be #var. replaced';
input.replace(/(#[a-z][a-z.]+[a-z])/gi, function(){
return '7';
});
This regex (#[a-z]([a-z.]+[a-z])*) matches #, then letter (in case there cannot be dot after #), then letters or dot and letter again at the end.
i modificator allows makes regex case-insensitive.
Your pattern is not restrictive enough i.e., it captures too much. The last #var. (including the dot) in your example string is captured because it is followed by a space (as required by the positive lookahead) which, in addition, is not followed by a dot and a space (as required by the negative lookahead). You can try this pattern:
/#([\S]*?)(?=[.]?\s)/g
It will match the #something substring (which can contain dot characters) both when it is followed by a space (as it happens in the first match of your string) and when it is followed by a dot and a space (as it happens in the second match of your string). Testing it in the chromium browser console it seems to work fine:
> 'this string has #var.thing.me two strings to be #var. replaced'.replace(/#([\S]*?)(?=[.]?\s)/g,function(){return '7';})
"this string has 7 two strings to be 7. replaced"
Try this
#((?!\. )\S)+
See it here at regexr
This matches a # followed by non whitespace characters \S. But it matches the next non whitespace only, if it is not a dot followed by a space. This is ensured by the negative lookahead assertion (?!\. ) before the \S.

simple regex to matching multiple word with spaces/multiple space or no spaces

I am trying to match all words with single or multiple spaces. my expression
(\w+\s*)* is not working
edit 1:
Let say i have a sentence in this form
[[do "hi i am bob"]]
[[do "hi i am Bob"]]
now I have to replace this with
cool("hi i am bob") or
cool("hi i am Bob")
I do not care about replacing multiple spaces with single .
I can achieve this for a single word like
\[\[do\"(\w+)\"\]\] and replacing regex cool\(\"$1\") but this does not look like an effective solution and does not match multiple words ....
I apologies for incomplete question
any help will be aprecciated
Find this Regular Expression:
/\[\[do\s+("[\w\s]+")\s*\]\]/
And do the following replacement:
'cool($1)'
The only special thing that's being done here is using character classes to our advantage with
[\w\s]+
Matches one or more word or space characters (a-z, A-Z, 0-9, _, and whitespace). That';; eat up your internal stuff no problem.
'[[do "hi i am Bob"]]'.replace(/\[\[do\s+("[\w\s]+")\s*\]\]/, 'cool($1)')
Spits out
cool("hi i am Bob")
Though - if you want to add punctuation (which you probably will), you should do it like this:
/\[\[do\s+("[^"]+")\s*\]\]/
Which will match any character that's not a double quote, preserving your substring. There are more complicated ones to allow you to deal with escaped quotation marks, but I think that's outside the scope of this question.
To match "all words with single or multiple spaces", you cannot use \s*, as it will match even no spaces.
On the other hand, it looks like you want to match even "hi", which is one word with no spaces.
You probably want to match one or more words separated by spaces. If so, use regex pattern
(\w+(?:$|\s+))+
or
\w+(\s+\w+)*
I'm not sure, but maybe this is what you're trying to get:
"Hi I am bob".match(/\b\w+\b/g); // ["Hi", "I", "am", "bob"]
Use regex pattern \w+(\s+\w+)* as follows:
m = s.match(/\w+(\s+\w+)*/g);
Simple. Match all groups of characters that are not white spaces
var str = "Hi I am Bob";
var matches = str.match(/[^ ]+/g); // => ["Hi", "I", "am", "Bob"]
What your regex is doing is:
/([a-zA-Z0-9_]{1,}[ \r\v\n\t\f]{0,}){0,}/
That is, find the first match of one or more of A through Z bother lower and upper along with digits and underscore, then followed by zero or more space characters which are:
A space character
A carriage return character
A vertical tab character
A new line character
A tab character
A form feed character
Then followed by zero or more of A through Z bother lower and upper along with digits and underscore.
\s matches more than just simple spaces, you can put in a literal space, and it will work.
I believe you want:
/(\w+ +\w+)/g
Which all matches of one or more of A through Z bother lower and upper along with digits and underscore, followed by one or more spaces, then followed by one or more of A through Z bother lower and upper along with digits and underscore.
This will match all word-characters separated by spaces.
If you just want to find all clusters of word characters, without punctuation or spaces, then, you would use:
/(\w+)/g
Which will find all word-characters that are grouped together.
var regex=/\w+\s+/g;
Live demo: http://jsfiddle.net/GngWn/
[Update] I was just answering the question, but based on the comments this is more likely what you're looking for:
var regex=/\b\w+\b/g;
\b are word boundaries.
Demo: http://jsfiddle.net/GngWn/2/
[Update2] Your edit makes it a completely different question:
string.replace(/\[\[do "([\s\S]+)"\]\]/,'cool("$1")');
Demo: http://jsfiddle.net/GngWn/3/

Remove entire word from string if it contains numeric value

What I'm trying to accomplish is to auto-generate tags/keywords for a file upload, basing these keywords from the filename.
I have accomplished auto-generating titles for each upload, as shown here:
But I have now moved on to trying to auto-generate keywords. Similar to titles, but with more formatting. First, I run the string through this to remove commonly used words from the filename (such as this,that,there... etc)
I am happy with it, but I need to not include words that have numbers in it. I have not found a solution on how to remove a word entirely if it contains a number. The solutions I have found like here only works for a certain match, while this one removes numbers alone. I would like to remove the entire word if it contains ANY numeric digit.
To remove all words which contain a number, use:
string = string.replace(/[a-z]*\d+[a-z]*/gi, '');
Try this expression:
var regex = /\b[^\s]*\d[^\s]*\b/g;
Example:
var str = "normal 5digit dig555it digit5 555";
console.log( str.replace(regex,'') );​ //Result-> normal
Apply a simple regular expression to you current filename strings, replacing all occurrences with the empty string. The regular expression matches "words" containing any digits.
Javascript example:
'asdf 8bit jawesome234 mayhem 234'.replace(/\s*\b\w*\d\w*\b/g, '')
Evaluates to:
"asdf mayhem"
Here the regular expression is /\s*\b\w*\d\w*\b/g, which matches maximal sequences consisting of zero or more whitespace characters (\s*) followed by a word-boundary transition (\b), followed by zero or more alphanum characters (\w*), followed by a digit (\d), followed by zero or more alphanum characters, followed by a word-boundary transition (\b). \b matches the empty string at the transition to an alphanumeric character from either the beginning or end of the word or a non-alphanumeric character. The g after the final / of the regular expression means replace all occurrences, not just the first.
Once the digit-words are removed, you can split the string into keywords however you want (by whitespace, for example).
"asdf mayhem".split(/\s+/);
Evaluates to:
["asdf", "mayhem"]
('Apple Cover Photo 23s423 of your 543634 moms').match(/\b([^\d]+)\b/g, '')
returns
Apple Cover Photo , of your , moms
http://jsfiddle.net/awBPX/2/
use this to Remove words containing numeric :
string.replace("[0-9]","");
hope this helps.
Edited :
check this :
var str = 'one 2two three3 fo4ur 5 six';
var result = str.match(/(^[\D]+\s|\s[\D]+\s|\s[\D]+$|^[\D]+$)+/g).join('');

Categories

Resources