Javascript Regex only replacing first match occurence - javascript

I am using regular expressions to do some basic converting of wiki markup code into copy-pastable plain text, and I'm using javascript to do the work.
However, javascript's regex engine behaves much differently to the ones I've used previously as well as the regex in Notepad++ that I use on a daily basis.
For example- given a test string:
==Section Header==
===Subsection 1===
# Content begins here.
## Content continues here.
I want to end up with:
Section Header
Subsection 1
# Content begins here.
## Content continues here.
Simply remove all equals signs.
I began with the regex setup of:
var reg_titles = /(^)(=+)(.+)(=+)/
This regex searches for lines that begin with one or more equals with another set of one or more equals. Rubular shows that it matches my lines accurately and does not catch equals signs in the middle of contet. http://www.rubular.com/r/46PrkPx8OB
The code to replace the string based on regex
var lines = $('.tb_in').val().split('\n'); //use jquery to grab text in a textarea, and split into an array of lines based on the \n
for(var i = 0;i < lines.length;i++){
line_temp = lines[i].replace(reg_titles, "");
lines[i] = line_temp; //replace line with temp
}
$('.tb_out').val(lines.join("\n")); //rejoin and print result
My result is unfortunately:
Section Header==
Subsection 1===
# Content begins here.
## Content continues here.
I cannot figure out why the regex replace function, when it finds multiple matches, seems to only replace the first instance it finds, not all instances.
Even when my regex is updated to:
var reg_titles = /(={2,})/
"Find any two or more equals", the output is still identical. It makes a single replacement and ignores all other matches.
No one regex expression executor behaves this way for me. Running the same replace multiple times has no effect.
Any advice on how to get my string replace function to replace ALL instances of the matched regex instead of just the first one?

^=+|=+$
You can use this.Do not forget to add g and m flags.Replace by ``.See demo.
http://regex101.com/r/nA6hN9/28

Add the g modifier to do a global search:
var reg_titles = /^(=+)(.+?)(=+)/g

Your regex is needlessly complex, and yet doesn't actually accomplish what you set out to do. :) You might try something like this instead:
var reg_titles = /^=+(.+?)=+$/;
lines = $('.tb_in').val().split('\n');
lines.forEach(function(v, i, a) {
a[i] = v.replace(reg_titles, '$1');
})
$('.tb_out').val(lines.join("\n"));

Related

Extracting a complicated part of the string with plain Javascript

I have a following string:
Text
I want to extract from this string, with the use of JavaScript 'pl' or 'pl_company_com'
There are a few variables:
jan_kowalski is a name and surname it can change, and sometimes even have 3 elements
the country code (in this example 'pl') will change to other en / de / fr (this is that part of the string i want to get)
the rest of the string remains the same for every case (beginning + everything after starting with _company_com ...
Ps. I tried to do it with split, but my knowledge of JS is very basic and I cant get what i want, plase help
An alternative to Randy Casburn's solution using regex
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_(.*_company_com)')[1];
console.log(out);
Or if you want to just get that string with those country codes you specified
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
A proof of concept that this solution also works for other combinations
let urls = [
new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx'),
new URL('https://my.domain.com/personal/firstname_middlename_lastname_pl_company_com/Documents/Forms/All.aspx')
]
urls.forEach(url => console.log(url.href.match('.*_(en|de|fr|pl).*')[1]))
I have been very successful before with this kind of problems with regular expressions:
var string = 'Text';
var regExp = /([\w]{2})_company_com/;
find = string.match(regExp);
console.log(find); // array with found matches
console.log(find[1]); // first group of regexp = country code
First you got your given string. Second you have a regular expression, which is marked with two slashes at the beginning and at the end. A regular expression is mostly used for string searches (you can even replace complicated text in all major editors with it, which can be VERY useful).
In this case here it matches exactly two word characters [\w]{2} followed directly by _company_com (\w indicates a word character, the [] group all wanted character types, here only word characters, and the {}indicate the number of characters to be found). Now to find the wanted part string.match(regExp) has to be called to get all captured findings. It returns an array with the whole captured string followed by all capture groups within the regExp (which are denoted by ()). So in this case you get the country code with find[1], which is the first and only capture group of the regular expression.

Regex to match only when certain characters follow a string

I need to find a string that contains "script" with as many characters before or after, and enclosed in < and >. I can do this with:<*script.*>
I also want to match only when that string is NOT followed by a <
The closest I've come, so far, is with this: (<*script.*>)([^=?<*]*)$
However, that will fail for something like <script></script> because the last > isn't followed by a < (so it doesn't match).
How can I check if only the the first > is followed by < or not?
For example,
<script> abc () ; </script> MATCH
<< ScriPT >abc (”XXX”);//<</ ScriPT > MATCH
<script></script> DON'T MATCH
And, a case that I still am working on:
<script/script> DON'T MATCH
Thanks!
You were close with your Regex. You just needed to make your first query non-greedy using a ? after the second *. Try this out:
(?i)<*\s*script.*?>[^<]+<*[^>]+>
There is an app called Expresso that really helps with designing Regex strings. Give it a shot.
Explanation: Without the ? non-greedy argument, your second * before the first > makes the search go all the way to the end of the string and grab the > at the end right at that point. None of the other stuff in your query was even being looked at.
EDIT: Added (?i) at the beginning for case-insensitivity. If you want a javascript specific case-insensitive regex, you would do that like this:
/<*\s*script.*?>[^<]+<*[^>]+>/i
I noticed you have parenthesis in your regex to make groups but you didn't specifically say you were trying to capture groups. Do you want to capture what's between the <script> and </script>? If so, that would be:
/<*\s*script.*?>([^<]+)<*[^>]+>/i
If I understand what you are looking for give this a try:
regex = "<\s*script\s*>([^<]+)<"
Here is an example in Python:
import re
textlist = ["<script>show this</script>","<script></script>"]
regex = "<\s*script\s*>([^<]+)"
for text in textlist:
thematch = re.search(regex, text, re.IGNORECASE)
if thematch:
print ("match found:")
print (thematch.group(1))
else:
print ("no match sir!")
Explanation:
start with < then possible spaces, the word script, possible spaces, a >
then capture all (at least 1) non < and make sure that's followed by a <
Hope that helps!
This would be better solved by using substring() and/or indexOf()
JavaScript methods

Regex Help for content between two strings (javascript)

Hoping someone might help. I have a string formatted like the example below:
Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...
What I am looking for is a Javascript RegEx to capture the content between the |EMBED||/EMBED| 'tags', run a function on that content, and then to replace the entire |EMBED|...|/EMBED| string with the return of that function.
The catch is that I may have multiple |EMBED| blocks within a larger string. For example:
Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.
I need the RegEx to capture and process each |EMBED| block separately, since the content contained within will be similar, but unique.
My initial thought is that I could just have a RegEx that captures the first iteration of the |EMBED| block, and the function which replaces this |EMBED| block is either part of a loop or recursion to continuously find the next block and replace it, until no more blocks are found in the string.
...but this seems expensive. Is there a more eloquent way?
You can use String.prototype.replace to replace a substring found via a regular expression with a modified version of the match using a mapping function, e.g.:
var input = 'Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.'
var output = input.replace(/\|EMBED\|(.*?)\|\/EMBED\|/g, function(match, p1) {
return p1.toUpperCase()
})
console.log(output) // "Yabba...{"CONTENT":"FOO"}. Dabba-do...{"CONTENT":"YO"}."
Make sure that you use a non-greedy selector .*? to select the content between the delimiters to allow multiple replacements per string.
This is the cod which iterate through the matches of the regex:
var str = 'Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...';
var rx = /\|EMBED\|(.*)\|\/EMBED\|/gi;
var match;
while (true)
{
match = rx.exec(str);
if (!match)
break;
console.log(match[1]); //match[1] is the content between "the tags"
}

Regular expression to match a string which is NOT matched by a given regexp

I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).
Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.
It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!

Javascript Regular expression to remove unwanted <br>,

I have a JS stirng like this
<div id="grouplogo_nav"><br> <ul><br> <li><a class="group_hlfppt" target="_blank" href="http://www.hlfppt.org/">&nbsp;</a></li><br> </ul><br> </div>
I need to remove all <br> and $nbsp; that are only between > and <. I tried to write a regular expression, but didn't got it right. Does anybody have a solution.
EDIT :
Please note i want to remove only the tags b/w > and <
Avoid using regex on html!
Try creating a temporary div from the string, and using the DOM to remove any br tags from it. This is much more robust than parsing html with regex, which can be harmful to your health:
var tempDiv = document.createElement('div');
tempDiv.innerHTML = mystringwithBRin;
var nodes = tempDiv.childNodes;
for(var nodeId=nodes.length-1; nodeId >= 0; --nodeId) {
if(nodes[nodeId].tagName === 'br') {
tempDiv.removeChild(nodes[nodeId]);
}
}
var newStr = tempDiv.innerHTML;
Note that we iterate in reverse over the child nodes so that the node IDs remain valid after removing a given child node.
http://jsfiddle.net/fxfrt/
myString = myString.replace(/^( |<br>)+/, '');
... where /.../ denotes a regular expression, ^ denotes start of string, ($nbsp;|<br>) denotes " or <br>", and + denotes "one or more occurrence of the previous expression". And then simply replace that full match with an empty string.
s.replace(/(>)(?: |<br>)+(\s?<)/g,'$1$2');
Don't use this in production. See the answer from Phil H.
Edit: I try to explain it a bit and hope my english is good enough.
Basically we have two different kinds of parentheses here. The first pair and third pair () are normal parentheses. They are used to remember the characters that are matched by the enclosed pattern and group the characters together. For the second pair, we don't need to remember the characters for later use, so we disable the "remember" functionality by using the form (?:) and only group the characters to make the + work as expected. The + quantifier means "one or more occurrences", so or <br> must be there one or more times. The last part (\s?<) matches a whitespace character (\s), which can be missing or occur one time (?), followed by the characters <. $1 and $2 are kind of variables that are replaces by the remembered characters of the first and third parentheses.
MDN provides a nice table, which explains all the special characters.
You need to replace globally. Also don't forget that you can have the being closed . Try this:
myString = myString.replace(/( |<br>|<br \/>)/g, '');
This worked for me, please note for the multi lines
myString = myString.replace(/( |<br>|<br \/>)/gm, '');
myString = myString.replace(/^( |<br>)+/, '');
hope this helps

Categories

Resources