match with Regular Expressions - javascript

I want to use match and a regular expression to split a string into an array.
Example:
var strdoc = '<p>noi dung</p>bài viết đúng.Đó thực sự là, cuối cùng';
var arrdocobj = strdoc.match(/(<.+?>)|(\s)|(\w+)(.+?)/g);
When I do console.log arrdocobj, it results in
["<p>", "noi ", "dung<", "p>", "bà", "i ", "viế", "t ", "ng.", " ", "thự", "c ", "sự", " ", "là", " ", "cuố", "i ", "cù", "ng"]
How can I split the string to an array like this
["<p>", "noi"," ", "dung", "<p>","bài"," ","viết"," ","đúng",".","Đó"," ","thực"," ","sự"," ","là", "," ," ","cuối"," ","cùng"]

You could maybe use something like that?
var strdoc = '<p>noi dung</p>tiêu đề bài viết đúng';
var arrdocobj = strdoc.match(/<[^>]+>|\S+?(?= |$|<)/g);
I was looking into using the \b with the unicode flag, but I guess it isn't available in JS, so I used (?= |$|<) to emulate the word boundary.
jsfiddle demo
EDIT: As per edit of question:
<[^>]+>|[^ .,!?:<]+(?=[ .,!?:<]|$)|.
might do the trick.
jsfiddle demo.
I just added a few more punctuations and the |. for the remaining stuff to match.

I thing the following regex does what you are asking in your edit:
var strdoc = '<p>noi dung</p>bài viết đúng.Đó thực sự là, cuối cùng';
var arrdocobj = strdoc.match(/<[^>]+>|[\s]+|[^\s<]+/g);
Unfortunatly JavaScript does not support Unicode categories like \p{L} for any Unicode Letter

Related

How to replace found regex sub string with spaces with equal length in javascript?

In javascript if I have something like
string.replace(new RegExp(regex, "ig"), " ")
this replaces all found regexes with a single space. But how would I do it if I wanted to replace all found regexes with spaces that matched in length?
so if regex was \d+, and the string was
"123hello4567"
it changes to
" hello "
Thanks
The replacement argument (2nd) to .replace can be a function - this function is called in turn with every matching part as the first argument
knowing the length of the matching part, you can return the same number of spaces as the replacement value
In the code below I use . as a replacement value to easily illustrate the code
Note: this uses String#repeat, which is not available in IE11 (but then, neither are arrow functions) but you can always use a polyfill and a transpiler
let regex = "\\d+";
console.log("123hello4567".replace(new RegExp(regex, "ig"), m => '.'.repeat(m.length)));
Internet Exploder friendly version
var regex = "\\d+";
console.log("123hello4567".replace(new RegExp(regex, "ig"), function (m) {
return Array(m.length+1).join('.');
}));
thanks to #nnnnnn for the shorter IE friendly version
"123hello4567".replace(new RegExp(/[\d]/, "ig"), " ")
1 => " "
2 => " "
3 => " "
" hello "
"123hello4567".replace(new RegExp(/[\d]+/, "ig"), " ")
123 => " "
4567 => " "
" hello "
If you just want to replace every digit with a space, keep it simple:
var str = "123hello4567";
var res = str.replace(/\d/g,' ');
" hello "
This answers your example, but not exactly your question. What if the regex could match on different numbers of spaces depending on the string, or it isn't as simple as /d more than once? You could do something like this:
var str = "123hello456789goodbye12and456hello12345678again123";
var regex = /(\d+)/;
var match = regex.exec(str);
while (match != null) {
// Create string of spaces of same length
var replaceSpaces = match[0].replace(/./g,' ');
str = str.replace(regex, replaceSpaces);
match = regex.exec(str);
}
" hello goodbye and hello again "
Which will loop through executing the regex (instead of using /g for global).
Performance wise this could likely be sped up by creating a new string of spaces with the length the same length as match[0]. This would remove the regex replace within the loop. If performance isn't a high priority, this should work fine.

How to create an array out of a string and keep spaces & punctuation at the end of each word?

Take this string as example:
'The strong, fast and gullible cat ran over the street!'
I want to create a function that takes this string and creates the following array:
['The ','strong, ','fast ','and ','gullible ','cat ','ran ','over ','the ','street!']
Observe that I want to keep each puctuation and space after each word.
This regular expression will match what you want: /[^\s]+\s*/g;
EXAPMLE:
var str = 'The strong, fast and gullible cat ran over the street!';
var result = str.match(/[^\s]+\s*/g);
console.log(result);
You can split at word boundaries followed by a word character:
var str = 'The strong, fast and gullible cat ran over the street!';
console.log(str.split(/\b(?=\w)/));
As #Summoner commented (while I was modifying my code to do it), if we add some char(s) that we want to use as a delimiter, we can then split on that, rather than the spaces.
var s= 'The strong, fast and gullible cat ran over the street!';
s = s.replace(/\s+/g, " **");
var ary = s.split("**");
console.log(ary);
Gonna toss my solution into the ring as well. :)
var sTestVal = 'The strong, fast and gullible cat ran over the street!';
var regexPattern = /[^ ]+( |$)/g;
var aResult = sTestVal.match(regexPattern)
console.log(aResult);
The result is:
["The ", "strong, ", "fast ", "and ", "gullible ", "cat ", "ran ", "over ", "the ", "street!"]
The regex pattern breaks down like this:
[^ ]+ - matches one or more non-space characters, and then
( |$) - either a space or the end of the string
g - it will match all instances of the patternthe end

Regex to replace all but the last non-breaking space if multiple words are joined?

Using javascript (including jQuery), I’m trying to replace all but the last non-breaking space if multiple words are joined.
For example:
Replace A String of Words with A String of Words
I think you want something like this,
> "A String of Words".replace(/ (?=.*? )/g, " ")
'A String of Words'
The above regex would match all the   strings except the last one.
Assuming your string is like this, you can use Negative Lookahead to do this.
var r = 'A String of Words'.replace(/ (?![^&]*$)/g, ' ');
//=> "A String of Words"
Alternative to regex, easier to understand:
var fn = function(input, sep) {
var parts = input.split(sep);
var last = parts.pop();
return parts.join(" ") + sep + last;
};
> fn("A String of Words", " ")
"A String of Words"

Regular expression to replace words preserving spaces

I'm trying to develop a function in javascript that get a phrase and processes each word, preserving whiteSpaces. It would be something like this:
properCase(' hi,everyone just testing') => ' Hi,Everyone Just Testing'
I tried a couple of regular expressions but I couldn't find the way to get just the words, apply a function, and replace them without touching the spaces.
I'm trying with
' hi,everyone just testing'.match(/([^\w]*(\w*)[^\w]*)?/g, 'x')
[" hi,", "everyone ", "just ", "testing", ""]
But I can't understand why are the spaces being captured. I just want to capture the (\w*) group. also tried with /(?:[^\w]*(\w*)[^\w]*)?/g and it's the same...
What about something like
' hi,everyone just testing'.replace(/\b[a-z]/g, function(letter) {
return letter.toUpperCase();
});
If you want to process each word, you can use
' hi,everyone just testing'.replace(/\w+/g, function(word) {
// do something with each word like
return word.toUpperCase();
});
When you use the global modifier (g), then the capture groups are basically ignored. The returned array will contain every match of the whole expression. It looks like you just want to match consecutive word characters, in which case \w+ suffices:
>>> ' hi,everyone just testing'.match(/\w+/g)
["hi", "everyone", "just", "testing"]
See here: jsfiddle
function capitaliseFirstLetter(match)
{
return match.charAt(0).toUpperCase() + match.slice(1);
}
var myRe = /\b(\w+)\b/g;
var result = "hi everyone, just testing".replace(myRe,capitaliseFirstLetter);
alert(result);
Matches each word an capitalizes.
I'm unclear about what you're really after. Why is your regex not working? Or how to get it to work? Here's a way to extract words and spaces in your sentence:
var str = ' hi,everyone just testing';
var words = str.split(/\b/); // [" ", "hi", ",", "everyone", " ", "just", " ", "testing"]
words = word.map(function properCase(word){
return word.substr(0,1).toUpperCase() + word.substr(1).toLowerCase();
});
var sentence = words.join(''); // back to original
Note: When doing any string manipulation, replace will be faster, but split/join allows for cleaner, more descriptive code.

How to ignore newline in regexp?

How to ignore newline in regexp in Javascript ?
for example:
data = "\
<test>11\n
1</test>\n\
#EXTM3U\n\
"
var reg = new RegExp( "\<" + "test" + "\>(.*?)\<\/" + "test" + "\>" )
var match = data.match(reg)
console.log(match[1])
result: undefined
In JavaScript, there is no flag to tell to RegExp() that . should match newlines. So, you need to use a workaround e.g. [\s\S].
Your RegExp would then look like this:
var reg = new RegExp( "\<" + "test" + "\>([\s\S]*?)\<\/" + "test" + "\>" );
You are missing a JS newline character \ at the end of line 2.
Also, change regexp to:
var data = "\
<test>11\n\
1</test>\n\
#EXTM3U\n\
";
var reg = new RegExp(/<test>(.|\s)*<\/test>/);
var match = data.match(reg);
console.log(match[0]);
http://jsfiddle.net/samliew/DPc2E/
By reading this one: How to use JavaScript regex over multiple lines?
I came with that, which works:
var data = "<test>11\n1</test>\n#EXTM3U\n";
reg = /<test>([\s\S]*?)<\/test>/;
var match = data.match(reg);
console.log(match[1]);
Here is a fiddle: http://jsfiddle.net/Rpkj2/
Better you can use [\s\S] instead of . for multiline matching.
It is the most common JavaScript idiom for matching everything including newlines. It's easier on the eyes and much more efficient than an alternation-based approach like (.|\n).
EDIT: Got it:
I tried to use this regex in notepad++ But the problem is that it finds the whole text from beginning to end
MyRegex:
<hostname-validation>(.|\s)*<\/pathname-validation> (finds everything)
/<hostname-validation>(.|\s)*<\/pathname-validation>/ (finds nothing)
/\<hostname-validation\>([\s\S]*?)\<\/pathname-validation\>/ (finds nothing)
**<hostname-validation>([\s\S]*?)<\/pathname-validation> (my desired result)**
The text where I use in:
<hostname-validation>www.your-tag-name.com</hostname-validation>
<pathname-validation>pathname</pathname-validation> <response-validation nil="true"/>
<validate-absence type="boolean">false</validate-absence> (...) <hostname-validation>www.your-tag-name.com</hostname-validation>
<pathname-validation>pathname</pathname-validation> <response-validation nil="false"/>
<validate-absence type="boolean">false</validate-absence> (...) <hostname-validation>www.your-tag-name.com</hostname-validation>
<pathname-validation>pathname</pathname-validation> <response-validation nil="true"/>
<validate-absence type="boolean">false</validate-absence> (...)

Categories

Resources