Regex replace text outside html tag

Regex replace text outside html tag - javascript

I'm working on an autocomplete component that highlights all ocurrences of searched text. What I do is explode the input text by words, and wrap every ocurrence of those words into a
My code looks like this
inputText = 'marriott st';
text = "Marriott east side";
textSearch = inputText.split(' ');
for (var i in textSearch) {
var regexSearch = new RegExp('(?!<\/?strong>)' + textSearch[i]), "i");
var textReplaced = regexSearch.exec(text);
text = text.replace(regexSearch, '< strong>' + textReplaced + '< /strong>');
}
For example, given the result: "marriott east side"
And the input text: "marriott st"
I should get
<strong>marriot< /strong > ea < strong >st < /strong > side
And i'm getting
<<strong>st</strong>rong>marriot</<strong>st </strong>rong>ea<<strong>st</strong> rong>s</strong> side
Any ideas how can I improve my regex, in order to avoid ocurrences inside the html tags? Thanks
/(?!<\/?strong>)st/

I would process the string in one pass. You can create one regular expression out of the search string:
var search_pattern = '(' + inputText.replace(/\s+/g, '|') + ')';
// `search_pattern` is now `(marriot|st)`
text = text.replace(RegExp(search_pattern, 'gi'), '<strong>$1</strong>');
DEMO
You could even split the search string first, sort the words by length and combine them, to give a higher precedence to longer matches.
You definitely should escape special regex characters inside the string: How to escape regular expression special characters using javascript?.

Before each search, I suggest getting (or saving) the original search string to work on each time. For example, in your current case that means you could replace all '<strong>' and '</strong>' tags with ''. This will help keep your regEx simple, especially if you decide to add other html tags and formatting in the future.

Related

JS conditional RegEx that removes different parts of a string between two delimiters

I have a string of text with HTML line breaks. Some of the <br> immediately follow a number between two delimiters «...» and some do not.
Here's the string:
var str = ("«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>");
I’m looking for a conditional regex that’ll remove the number and delimiters (ex. «1») as well as the line break itself without removing all of the line breaks in the string.
So for instance, at the beginning of my example string, when the script encounters »<br> it’ll remove everything between and including the first « to the left, to »<br> (ex. «1»<br>). However it would not remove «2»some text<br>.
I’ve had some help removing the entire number/delimiters (ex. «1») using the following:
var regex = new RegExp(UsedKeys.join('|'), 'g');
var nextStr = str.replace(/«[^»]*»/g, " ");
I sure hope that makes sense.
Just to be super clear, when the string is rendered in a browser, I’d like to go from this…
«1»
«2»some text
«3»
«4»more text
«5»
«6»even more text
To this…
«2»some text
«4»more text
«6»even more text
Many thanks!

Maybe I'm missing a subtlety here, if so I apologize. But it seems that you can just replace with the regex: /«\d+»<br>/g. This will replace all occurrences of a number between « & » followed by <br>
var str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\d+»<br>/g, '')
console.log(newStr)
To match letters and digits you can use \w instead of \d
var str = "«a»<br>«b»some text<br>«hel»<br>«4»more text<br>«5»<br>«6»even more text<br>"
var newStr = str.replace(/«\w+?»<br>/g, '')
console.log(newStr)

This snippet assumes that the input within the brackets will always be a number but I think it solves the problem you're trying to solve.
const str = "«1»<br>«2»some text<br>«3»<br>«4»more text<br>«5»<br>«6»even more text<br>";
console.log(str.replace(/(«(\d+)»<br>)/g, ""));
/(«(\d+)»<br>)/g
«(\d+)» Will match any brackets containing 1 or more digits in a row
If you would prefer to match alphanumeric you could use «(\w+)» or for any characters including symbols you could use «([^»]+)»
<br> Will match a line break
//g Matches globally so that it can find every instance of the substring
Basically we are only removing the bracketed numbers if they are immediately followed by a line break.

Uppercase for each new word swedish characters and html markup

I was pointed out to this post, which does not seem to follow the criteria I have:
Replace a Regex capture group with uppercase in Javascript
I am trying to make a regex that will:
format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)
Say I've got this string:
<b>app</b>le store östersund
Then I want it to be (changes marked by uppercase characters)
<b>App</b>le Store Östersund
I've been playing around with it and the closest I've got is the following:
(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w
Resulted in
<b>app</b>le Store Östersund
Or this
/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g
Resulted in
<B>App</B>Le store Östersund
Here's a fiddle:
http://refiddle.com/refiddles/598aabef75622d4a531b0000
Any help or advice is much appreciated.

It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.
function htmlToTitlecase(html, letters) {
let div = document.createElement('div');
let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
div.innerHTML = html;
let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
let startOfWord = true;
while (treeWalker.nextNode()) {
let node = treeWalker.currentNode;
node.data = node.data.replace(re, function(match, space, letter) {
if (space || startOfWord) {
return space + letter.toUpperCase();
} else {
return match;
}
});
startOfWord = node.data.match(/\s$/);
}
return div.innerHTML;
}
console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund
[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.
EDIT:
Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.
This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...
function simpleHtmlToTitlecaseSwedish(html) {
return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
return space + tag + letter.toUpperCase();
});
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));

I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)
You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)
Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3
Test it here
Here is a working javascript code :
// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";
// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");
// Display result
console.log(result);
Edit : I forgot to handle first word of the string, it's corrected :)

Regex with multiple start and end characters that must be the same

I would like to be able to search for strings inside a special tag in a string in JavaScript. Strings in JavaScript can start with either " or ' character.
Here an example to illustrate what I want to do. My custom tag is called <my-tag. My regex is /('|")*?<my-tag>((.|\n)[^"']*?)<\/my-tag>*?('|")/g. I use this regex pattern on the following strings:
var a = '<my-tag>Hello World</my-tag>'; //is found as expected
var b = "<my-tag>Hello World" + '</my-tag>'; //is NOT found, this is good!
var c = "<my-tag>Hello World</my-tag>"; //is found as expected
var d = '<my-tag>something "special"</my-tag>'; //here the " char causes a problem
var e = "<my-tag>something 'special'</my-tag>"; //here the " char causes a problem
It works well with a and also c where it finds the tag with the containing text. It also does not find the text in b which is what I want. But in case d and e the tag with content is not found due to the occurrence of the " and ' character. What I want is a regex where inside the tag " is allowed if the string is start with ', and vice versa.
Is it possible to achieve this with one regex, or is the only thing I can do is to work with two separate regex expressions like
/(")*?<my-tag>((.|\n)[^']*?)<\/my-tag>*?(")/g and /(')*?<my-tag>((.|\n)[^"]*?)<\/my-tag>*?(')/g ?

It's not pretty, but I think this would work:
/("<my-tag>((.|\n)[^"]*?)<\/my-tag>"|'<my-tag>((.|\n)[^']*?)<\/my-tag>')/g

You should be able to use de match from the first match ('|") and reuse it for the second match. Something like the following:
/('|")<my-tag>.*?<\/my-tag>\1/g
This should make sure to match the same character at the beginning and the end.
But you really shouldn't use regex for parsing HTML.

regex to match only quotes that aren't in links

can you tell me how I can in javascript using regex to select quoted text, but not the one that is in the link
so I don't want to select these quotes some text
I want to select only normal quoted text
I used
result = content.replace(/"(.*?)"/g, "<i>$1</i>");
to replace all quoted text with italic, but it replaces also href quotes
Thanks :)

If you need an adhoc regex solution, you may match and capture tags, and only replace " symbols in other contexts. Defining a tag as <+non-<s up to the first >, we may use
var s = '"replace this" but <div id="not-here"> "and here"</div>';
var re = /(<[^<]*?>)|"(.*?)"/g;
var result = s.replace(re, function (m,g1,g2) {
return g1? g1 : '<i>' + g2 + '</i>';
});
console.log(result);
The (<[^<]*?>)|"(.*?)" matches:
(<[^<]*?>) - Group 1 (g1 later in the callback) that captures <, 0+ symbols other than < as few as possible up to the first >
| - or
"(.*?)" - ", 0+ chars other than a newline as few as possible captured into Group 2 (g2 later) and a ".
In the callback method, Group 1 is checked for a match, and if yes, we just put the tag back into the result, else, replace with the tags.

The simplest answer would be to use:
/[^=]"(.*)"/
instead of
/"(.*?)"/
But that will also include quotes that have = sign before them.

Why not only work on the actual text of the element... Like:
var anchors = [],
idx;
anchors = Array.prototype.slice.call(document.getElementsByTagName("a"));
for(idx=0; idx<anchors.length; idx++) {
anchors[idx].innerHTML = anchors[idx].innerHTML.replace(/"([^"]*)"/g, '<i>$1</i>');
}
some text that contains a "quoted" part.
<br/>
more "text" that contains a "quoted" part.
Here we get all anchor elements as an array and replace the innerHTML text with a italicized version of itself.

This pattern could be what you're looking for: <.+>.*(\".+\").*</.+>
Used in JavaScript, the following matches "text":
new RegExp('<.+>.*(\".+\").*</.+>', 'g').exec('some "text"')[1]

ASCII character not being recognized in if statement

I am trying to get a string from a html page with jquery and this is what I have.
var text = $(this).text();
var key = text.substring(0,1);
if(key == ' ' || key == ' ')
key = text.substring(1,2);
text is this  Home
And I want to skip the space and or the keycode above It appears this code does not work either. It only gets the text.substring(0,1); instead of text.substring(1,2); because the if statement is not catching.= and I am not sure why. Any help would be super awesome! Thanks!

There are several problems with the code in the question. First,   has no special meaning in JavaScript: it is a string literal with six characters. Second, text.substring(1,2) returns simply the second character of text, not all characters from the second one onwards.
Assuming that you wish to remove one leading SPACE or NO-BREAK SPACE (which is what   means in HTML; it is not an Ascii character, by the way), then the following code would work:
var first = text.substring(0, 1);
if(first === ' ' || first === '\u00A0') {
text = text.substring(1, text.length);
}
The notation \u00A0 is a JavaScript escape notation for NO-BREAK SPACE U+00A0.
Should you wish to remove multiple spaces at the start, and perhaps at the end too, some modifications are needed. In that case, using a replace operation with regular expression is probably best.

If you want remove spaces at the beginning (and end) of a string, you can use the trim function
var myvar = " home"
myVar.trim() // --> "home"
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/Trim

Develop Reference

JavaScript is the programming language of the Web.

Regex replace text outside html tag - javascript

Related

JS conditional RegEx that removes different parts of a string between two delimiters

Uppercase for each new word swedish characters and html markup

Regex with multiple start and end characters that must be the same

regex to match only quotes that aren't in links

ASCII character not being recognized in if statement

Categories

Resources