Extracting text that sits between two pairs of special characters - javascript

I am trying to extract a string from a sentence that is embedded within the HTML tags <b></b> that are also embedded within parenthesis ( ).
I can do this with the following code
const regExp = /\(([^)]+)\)/
// fetches the string within parentheses
let string = regExp.exec('This is some (<b>super cool</b>) text I have here')
// output = '<b>super cool</b>
// removes the html tags
let string2 = string.replace(/<[^>]*>?/gm, '')
// output = 'super cool'
The problem is I sometimes have sentences with multiple sets of parentheses. The code above will only extract the first instance of parentheses, and they may or may not be within the <b></b> tags
i.e., the string
This is (some) (<b>super cool</b>) text I have (here)
will return some using the same code above, but what I want is to return super cool
How can I traverse the entire string to extract only the text that sits within (<b> and </b>)?
EDIT
I forgot to mention (apologies), there may be text that comes in between the closing tag </b> and the closing parenthesis ). For example
This is some (<b>super cool</b> groovy) text I have here
Which adds a bit of complexity (otherwise I could use split() and pop()

You could use this regExp instead: /(?<=\(<b>)(.*?)(?=<\/b>\))/ which will capture everything between the first (<b> and </b>) encountered.
If you want to capture all instances, just add the global flag /g : /(?<=\(<b>)(.*?)(?=<\/b>\))/g
Also with this method you won't need to do a string.replace() afterwards, saving you another operation.
const regExp = /(?<=\(<b>)(.*?)(?=<\/b>\))/
const str = 'This is some (<b>super cool</b>) text I have here'
console.log(str.match(regExp)[0])
// --> super cool
EDIT: Following OP's edit, if some text can come between the closing tag </b> and the closing ), just change your regExp to: /(?<=\(<b>)(.*?)(?=\))/, which will capture everything between the first (<b> and ) encountered.
But then you will also need to string.replace('</b>', '') to remove the closing </b> tag.
const regExp = /(?<=\(<b>)(.*?)(?=\))/
const str = 'This is some (<b>super cool</b> groovy) text I have here'
console.log(str.match(regExp)[0].replace('</b>', ''))
// --> super cool groovy

This works for me try like this instead of regex use split
const string = 'This is (some) (<b>super cool</b>) text I have (here)';
const str = string.split('<b>').pop().split('</b>')[0];
console.log(str);

Related

Is it possible to move substrings to a specific location with RegEx?

Background: I used quill.js to get some rich text input. The result I want is quite similar to HTML so I went with the quill.container.firstChild.innerHTML approach instead of actually serializing the data. But when it comes to anchor, instead of
Anchor
I actually want
Anchor{{link:test.html}}
With .replace() method I easily got {{link:test.html}}Anchor</a> but I need to put the link description after the Anchor text. Is there a way to swap {{link:test.html}} with the next </a> so I can get the desired result? There can be multiple anchors in the string, like:
str = 'This is a test. And another one here.'
I would like it to become:
str = 'This is a test{{link:test1.html}}. And another one{{link:test2.html}} here.'
You could also use dom methods. The dom is a better html parser than regex. This is a fairly simple replaceWith
str = 'This is a test. And another one here.'
var div = document.createElement('div');
div.innerHTML = str;
div.querySelectorAll('a').forEach(a=>{
a.replaceWith(`${a.textContent}{{link:${a.getAttribute('href')}}}`)
})
console.log(div.innerHTML)
Yes, you can use capture groups and placeholders in the replacement string, provided it really is in exactly the format you've shown:
const str = 'This is a test. And another one here.';
const result = str.replace(/<a href="([^"]+)">([^<]+)<\/a>/g, "$2{{link:$1}}");
console.log(result);
This is very fragile, which is why famously you don't use regular expressions to parse HTML. For instance, it would fail with this input string:
const str = 'This is a test <span>blah</span>. And another one here.';
...because of the <span>blah</span>.
But if the format is as simple and consistent as you appear to be getting from quill.js, you can apply a regular expression to it.
That said, if you're doing this on a browser or otherwise have a DOM parser available to you, use the DOM as charlietfl demonstrates;

Regex Help for content between two strings (javascript)

Hoping someone might help. I have a string formatted like the example below:
Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...
What I am looking for is a Javascript RegEx to capture the content between the |EMBED||/EMBED| 'tags', run a function on that content, and then to replace the entire |EMBED|...|/EMBED| string with the return of that function.
The catch is that I may have multiple |EMBED| blocks within a larger string. For example:
Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.
I need the RegEx to capture and process each |EMBED| block separately, since the content contained within will be similar, but unique.
My initial thought is that I could just have a RegEx that captures the first iteration of the |EMBED| block, and the function which replaces this |EMBED| block is either part of a loop or recursion to continuously find the next block and replace it, until no more blocks are found in the string.
...but this seems expensive. Is there a more eloquent way?
You can use String.prototype.replace to replace a substring found via a regular expression with a modified version of the match using a mapping function, e.g.:
var input = 'Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.'
var output = input.replace(/\|EMBED\|(.*?)\|\/EMBED\|/g, function(match, p1) {
return p1.toUpperCase()
})
console.log(output) // "Yabba...{"CONTENT":"FOO"}. Dabba-do...{"CONTENT":"YO"}."
Make sure that you use a non-greedy selector .*? to select the content between the delimiters to allow multiple replacements per string.
This is the cod which iterate through the matches of the regex:
var str = 'Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...';
var rx = /\|EMBED\|(.*)\|\/EMBED\|/gi;
var match;
while (true)
{
match = rx.exec(str);
if (!match)
break;
console.log(match[1]); //match[1] is the content between "the tags"
}

Regular expression to match a string which is NOT matched by a given regexp

I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).
Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.
It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!

Match text not inside span tags

Using Javascript, I'm trying to wrap span tags around certain text on the page, but I don't want to wrap tags around text already inside a set of span tags.
Currently I'm using:
html = $('#container').html();
var regex = /([\s| ]*)(apple)([\s| ]*)/g;
html = html.replace(regex, '$1<span class="highlight">$2</span>$3');
It works but if it's used on the same string twice or if the string appears in another string later, for example 'a bunch of apples' then later 'apples', I end up with this:
<span class="highlight">a bunch of <span class="highlight">apples</span></span>
I don't want it to replace 'apples' the second time because it's already inside span tags.
It should match 'apples' here:
Red apples are my <span class="highlight">favourite fruit.</span>
But not here:
<span class="highlight">Red apples are my favourite fruit.</span>
I've tried using this but it doesn't work:
([\s| ]*)(apples).*(?!</span)
Any help would be appreciated. Thank you.
First off, you should know that parsing html with regex is generally considered to be a bad idea—a Dom parser is usually recommended. With this disclaimer, I will show you a simple regex solution.
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
<span.*?<\/span>|(\bapples\b)
The left side of the alternation | matches complete <span... /span> tags. We will ignore these matches. The right side matches and captures apples to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This program shows how to use the regex (see the results in the right pane of the online demo). Please note that in the demo I replaced with [span] instead of <span> so that the result would show in the browser (which interprets the html):
var subject = 'Red apples are my <span class="highlight">favourite apples.</span>';
var regex = /<span.*?<\/span>|(\bapples\b)/g;
replaced = subject.replace(regex, function(m, group1) {
if (group1 == "" ) return m;
else return "<span class=\"highlight\">" + group1 + "</span>";
});
document.write("<br>*** Replacements ***<br>");
document.write(replaced);
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Removing non-break-spaces in JavaScript

I am having trouble removing spaces from a string. First I am converting the div to text(); to remove the tags (which works) and then I'm trying to remove the "&nbsp" part of the string, but it won't work. Any Idea what I'm doing wrong.
newStr = $('#myDiv').text();
newStr = newStr.replace(/ /g, '');
$('#myText').val(newStr);
<html>
<div id = "myDiv"><p>remove space</p></div>
<input type = "text" id = "myText" />
</html>
When you use the text function, you're not getting HTML, but text: the entities have been changed to spaces.
So simply replace spaces:
var str = " a     b   ", // bunch of NBSPs
newStr = str.replace(/\s/g,'');
console.log(newStr)
If you want to replace only the spaces coming from do the replacement before the conversion to text:
newStr = $($('#myDiv').html().replace(/ /g,'')).text();
.text()/textContent do not contain HTML entities (such as ), these are returned as literal characters. Here's a regular expression using the non-breaking space Unicode escape sequence:
var newStr = $('#myDiv').text().replace(/\u00A0/g, '');
$('#myText').val(newStr);
Demo
It is also possible to use a literal non-breaking space character instead of the escape sequence in the Regex, however I find the escape sequence more clear in this case. Nothing that a comment wouldn't solve, though.
It is also possible to use .html()/innerHTML to retrieve the HTML containing HTML entities, as in #Dystroy's answer.
Below is my original answer, where I've misinterpreted OP's use case. I'll leave it here in case anyone needs to remove from DOM elements' text content
[...] However, be aware that re-setting the .html()/innerHTML of an element means trashing out all of the listeners and data associated with it.
So here's a recursive solution that only alters the text content of text nodes, without reparsing HTML nor any side effects.
function removeNbsp($el) {
$el.contents().each(function() {
if (this.nodeType === 3) {
this.nodeValue = this.nodeValue.replace(/\u00A0/g, '');
} else {
removeNbsp( $(this) );
}
});
}
removeNbsp( $('#myDiv') );
Demo

Categories

Resources