Replace surrounded text with "**" with multiple characters javascript - javascript

I have this text:
Hello! this is **Some ** random text that i have to add **html ** format
Some words are surrounded by '**' and i have to replace those with bold html format " < b > " for the first couple of "[asterisk]" and " < / b > " for the final '*'.
i'm not able to search '**' because is a regular expression...
Any suggestions?

One possible approach:
const raw = 'Hello! this is **Some ** random text that i have to add **html ** format';
const tagged = raw.replace(/\*{2}([^*]+)\*{2}/g, '<b>$1</b>');
console.log(tagged);
// Hello! this is <b>Some </b> random text that i have to add <b>html </b> format
The trick, as mentioned in comments, is to use backslash to escape the asterisks (those are metacharacters in regex land).
Having said that, I strongly recommend at least considering usage of proper markdown libraries to do markdown stuff AND sanitize your output before injecting it into HTML in one way or another.

Related

Regex: accurately match bold (**) and italics (*) item(s) from the input

I am trying to parse a markdown content with the use of regex. To grab bold and italic items from the input, I'm currently using a regex:
/(\*\*)(?<bold>[^**]+)(\*\*)|(?<normal>[^`*[~]+)|\*(?<italic>[^*]+)\*/g
Regex101 Link: https://regex101.com/r/2zOMid/1
The problem with this regex are:
if there is a single * in between a bold text content, the match is breaked
if there are long texts like ******* anywhere in between the match is broken
#####: tried with:
I tried removing the [^**] part in the bold group but that messed up the bold match with finding the last ** occurrence and including all `**`` chars within
What I want to have:
accurate bold
* allowed inside bold
accurate italics
Language: Javascript
Assumptions:
Bold text wrapped inside **
Italic text wrapped inside *
There was some discussion in the chat going on. Just to have it mentioned, there is no requirement yet on how to deal with escaped characters like \* so I didn't take attention of it.
Depending on the desired outcome I'd pick a two step solution and keep the patterns simple:
str = str.replace(/\*\*(.+?)\*\*(?!\*)/g,'<b>$1</b>').replace(/\*([^*><]+)\*/g,'<i>$1</i>');
Step 1: Replace bold parts
Replace \*\*(.+?)\*\*(?!\*) with <b>$1</b> -> Regex101
demo
It captures (.+?) one or more characters between ** lazily to $1
and uses a lookahead for matching the outher most * at the end.
Step 2: Now as the amount of remaining * is reduced, italic parts
Replace remaining \*([^*><]+)\* to <i>$1</i> -> Regex101
demo
[^*><]+ matches one or more characters that are not *, > or <.
Here is the JS-demo at tio.run
Myself I don't think it's a good idea to rely on the amount of the same character for distinguishing between kinds of replacement. The way how it works finally gets a matter of taste.
[^**] will not avoid two consecutive *. It is a character class that is no different from [^*]. The repeated asterisk has no effect.
The pattern for italic should better come in front of the normal part, which should capture anything that remains. This could even be a sole asterisk (for example) -- the pattern for normal text should allow this.
It will be easier to use split and use the bold/italic pattern for matching the "delimiter" of the split, while still capturing it. All the rest will then be "normal". The downside of split is that you cannot benefit from named capture groups, but they will just be represented by separate entries in the returned array.
I will ignore the other syntax that markdown can have (like you seem to hint at with [ and ~ in your regex). On the other hand, it is important to deal well with backslash, as it is used to escape an asterisk.
Here is the regular expression (link):
(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1
Here is a snippet with two functions:
a function that first splits the input into tokens, where each token is a pair, like ["normal", " this is normal text "] and ["i", "text in italics"]
another function that uses these tokens to generate HTML
The snippet is interactive. Just type the input, and the output will be rendered in HTML using the above sequence.
function tokeniseMarkdown(s) {
const regex = /(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1/gs;
const styles = ["i", "b"];
// Matches follow this cyclic order:
// normal text, mark (= "*" or "**"), formatted text, normal text, ...
const types = ["normal", "mark", ""];
return s.split(regex).map((match, i, matches) =>
types[i%3] !== "mark" && match &&
[types[i%3] || styles[matches[i-1].length-1],
match.replace(/\\([\\*])/g, "$1")]
).filter(Boolean); // Exclude empty matches and marks
}
function tokensToHtml(tokens) {
const container = document.createElement("span");
for (const [style, text] of tokens) {
let node = style === "normal" ? document.createTextNode(text)
: document.createElement(style);
node.textContent = text;
container.appendChild(node);
}
return container.innerHTML;
}
// I/O management
document.addEventListener("input", refresh);
function refresh() {
const s = document.querySelector("textarea").value;
const tokens = tokeniseMarkdown(s);
document.querySelector("div").innerHTML = tokensToHtml(tokens);
}
refresh();
textarea { width: 100%; height: 6em }
div { font: 22px "Times New Roman" }
<textarea>**fi*rst b** some normal text here **second b** *first i* normal *second i* normal again</textarea><br>
<div></div>
Looking some more about the negative lookaheads, I came up with this regex:
/\*\*(?<bold>(?:(?!\*\*).)+)\*\*|`(?<code>[^`]+)`|~~(?<strike>(?:(?!~~).)+)~~|\[(?<linkTitle>[^]]+)]\((?<linkHref>.*)\)|(?<normal>[^`[*~]+)|\*(?<italic>[^*]+)\*|(?<tara>[*~]{3,})|(?<sitara>[`[]+)/g
Regex101
this pretty much works for me as per my input scenarios. If someone has a more optimized regex, please comment.
italic: ((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)
(?<!\s)\*(?!\s) means matching the start * with no space around,
(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+? means match ** with even appearance, by which negalates meaningless ** inside intalic.
|[^\*\*]+? means if there's no match for one or more ** pair, match anything except a single **.(this "or" order is important)
(?<!\s)*) means matching the end * with no space ahead
And ?: is non-capturing group in js, you can delete it if not needing
bold:
((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)
Similar to italic, except the order of * pair and other character.
Together you can get:
((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)|((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)
See the result here: https://regex101.com/r/9gTBpj/1
You can choose the tags depending on the number of asterisks. (1 → italic, 2 → bold, 3 → bold+italic)
function simpleMarkdownTransform(markdown) {
return markdown
.replace(/</g, '&lt') // disallow tags
.replace(/>/g, '&gt')
.replace(
/(\*{1,3})(.+?)\1(?!\*)/g,
(match, { length: length }, text) => {
if (length !== 2) text = text.italics()
return length === 1 ? text : text.bold()
}
)
.replace(/\n/g, '<br>') // new line
}
Example:
simpleMarkdownTransform('abcd **bold** efgh *italic* ijkl ***bold-italic*** mnop')
// "abcd <b>bold</b> efgh <i>italic</i> ijkl <b><i>bold-italic</i></b> mnop"

Pure Regex solution for getting text content from a string of HTML in an environment where I cannot rely on document.createElement?

I have strings of HTML and I want to get the text content of the elements, but the environment I'm working in doesn't allow me to create an element and then simply get innerText like:
const span = document.createElement('span');
span.innerHTML = myHtmlString;
const justTheText = span.innerText;
Is it possible to do this with only Regex? I've given it a number of attempts, but never come up with a working solution. The nested nature of the tags leads to me getting 90% working solutions, but I can't find any way to handle that aspect. (Apologies for not having an example of one of my attempts, I'm just revisiting this issue after abandoning it months ago after spending multiple days on it.)
I've also never found a workaround, regex or not, as 99.999% of the time the right answer is to use the code I posted above, and that's exactly the answer that's given.
(I'd also be open to non-regex solutions)
Edit:
Example of HTML String:
<div>
<p class="someclass">
Some plain text
<strong>
and some bold
</strong>
</p>
</div>
Getting the text from a single html element via regex is easy, but I'm not sure there's any way to handle the nesting to get the result: Some plain text and some bold - If there is a way I'm not aware of it, but some of the most advanced features of regex are still beyond my understanding.
You could always get the content of a tag.
From the content, remove the inner tags, then trim the whitespace.
In the example we're using the div tag, but you could also use
any tag with attributes, like the p tag below.
Here is a JS example:
var tag = "div";
// var tag = "p"; // <= try this; works with tags with attributes as well
var rxTagContent = new RegExp( "<" + tag + "(?:\\s*>|\\s+(?=((?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+))\\1>)((?:(?=(<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\4\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>|[\\S\\s]))\\3)*?)</" + tag + "\\s*>", "g" );
var rxRmvInnerTags =
/<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/g;
var rxWspTrim = /\s+/g;
////////////////////////////////////////////////
//
var html =
"<div>\n" +
" <p class=\"someclass\">\n" +
" Some plain text \n" +
" <strong>\n" +
" and some bold\n" +
" </strong>\n" +
" </p>\n" +
"</div>\n";
var match;
while ( match = rxTagContent.exec( html ) )
{
var cont = match[2]; // group 2 is content
var clean = cont.replace( rxRmvInnerTags, "" );
var trim = clean.replace( rxWspTrim, " " );
console.log ("content = " + cont );
console.log ("clean and trim = \n" + trim );
}
This is the expanded, readable version of the constructed Tag Content regex.
Note that this regex and the one to remove the inner tags are
slightly sophisticated. Should you need specific information on
how they work just let me know. I usually show up every few days,
sometimes a week or two depending how many of my comments are
being deleted by administrator whoever ...
Update: Modified regex to avoid matching the closing tag text
if it happens to be inside a CDATA or even if it's part of another
tag's value, or even if it's in invisible content like a script.
For example, this below will match correctly.
Note the only thing missing is the ability to nest the tag.
This being JavaScript it's not possible. Regex can be used to
find tags and content a piece at a time for a fully custom parse.
But that's a different story.
This though, is going to find the first open tag and the first close tag.
It still can be modified 1 step further to find an un-nested
open / close tag if needed, a simple added assertion is needed.
Also note that this doesn't prevent matching the open tag
if it happens to be inside a CDATA or others as stated above.
This can be avoided but requires expansion of the tag regex and a check within the while() loop to go past these.
Let me know if you may need this ( or I just may add that in a
day or so. I don't want it to be too out of control ), it is possible though.
<tag>
Some content
more
and more
<script>
var xyz;
var tag = "</tag>";
</script>
<![CDATA[ </tag> asdfasdf]]>
</tag>
https://regex101.com/r/Bs4ySe/1
<tag
(?:
\s* >
| \s+
(?=
( # (1 start)
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
) # (1 end)
)
\1 >
)
( # (2 start)
(?:
(?=
( # (3 start)
<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\4\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
| [\S\s]
) # (3 end)
)
\3
)*?
) # (2 end)
</tag \s* >
The regex example above is very good. Creating groups with () is the key because then you can pick out the text by itself. I would try to take a slightly simpler approach using recursion to deal with the nesting
An alternate approach is to use the npm package "cheerio". This is commonly used in web scraping but you could feed it any html. Then methods similar to jQuery can be used to traverse the html and pick out the content

regex to match only quotes that aren't in links

can you tell me how I can in javascript using regex to select quoted text, but not the one that is in the link
so I don't want to select these quotes some text
I want to select only normal quoted text
I used
result = content.replace(/"(.*?)"/g, "<i>$1</i>");
to replace all quoted text with italic, but it replaces also href quotes
Thanks :)
If you need an adhoc regex solution, you may match and capture tags, and only replace " symbols in other contexts. Defining a tag as <+non-<s up to the first >, we may use
var s = '"replace this" but <div id="not-here"> "and here"</div>';
var re = /(<[^<]*?>)|"(.*?)"/g;
var result = s.replace(re, function (m,g1,g2) {
return g1? g1 : '<i>' + g2 + '</i>';
});
console.log(result);
The (<[^<]*?>)|"(.*?)" matches:
(<[^<]*?>) - Group 1 (g1 later in the callback) that captures <, 0+ symbols other than < as few as possible up to the first >
| - or
"(.*?)" - ", 0+ chars other than a newline as few as possible captured into Group 2 (g2 later) and a ".
In the callback method, Group 1 is checked for a match, and if yes, we just put the tag back into the result, else, replace with the tags.
The simplest answer would be to use:
/[^=]"(.*)"/
instead of
/"(.*?)"/
But that will also include quotes that have = sign before them.
Why not only work on the actual text of the element... Like:
var anchors = [],
idx;
anchors = Array.prototype.slice.call(document.getElementsByTagName("a"));
for(idx=0; idx<anchors.length; idx++) {
anchors[idx].innerHTML = anchors[idx].innerHTML.replace(/"([^"]*)"/g, '<i>$1</i>');
}
some text that contains a "quoted" part.
<br/>
more "text" that contains a "quoted" part.
Here we get all anchor elements as an array and replace the innerHTML text with a italicized version of itself.
This pattern could be what you're looking for: <.+>.*(\".+\").*</.+>
Used in JavaScript, the following matches "text":
new RegExp('<.+>.*(\".+\").*</.+>', 'g').exec('some "text"')[1]

How to create string with multiple spaces in JavaScript

By creating a variable
var a = 'something' + ' ' + 'something'
I get this value: 'something something'.
How can I create a string with multiple spaces on it in JavaScript?
In 2022 - use ES6 Template Literals for this task.
If you need IE11 Support - use a transpiler.
let a = `something something`;
Template Literals are fast, powerful, and produce cleaner code.
If you need IE11 support and you don't have transpiler, stay strong 💪 and use \xa0 - it is a NO-BREAK SPACE char.
Reference from UTF-8 encoding table and Unicode characters, you can write as below:
var a = 'something' + '\xa0\xa0\xa0\xa0\xa0\xa0\xa0' + 'something';
in ES6:
let a = 'something' + ' '.repeat(10) + 'something'
old answer:
var a = 'something' + Array(10).fill('\xa0').join('') + 'something'
number inside Array(10) can be changed to needed number of spaces
Use
It is the entity used to represent a non-breaking space. It is essentially a standard space, the primary difference being that a browser should not break (or wrap) a line of text at the point that this occupies.
var a = 'something' + '&nbsp &nbsp &nbsp &nbsp &nbsp' + 'something'
Non-breaking Space
A common character entity used in HTML is the non-breaking space ( ).
Remember that browsers will always truncate spaces in HTML pages. If you write 10 spaces in
your text, the browser will remove 9 of them. To add real spaces to your text,
you can use the
character entity.
http://www.w3schools.com/html/html_entities.asp
Demo
var a = 'something' + '&nbsp &nbsp &nbsp &nbsp &nbsp' + 'something';
document.body.innerHTML = a;
With template literals, you can use multiple spaces or multi-line strings and string interpolation. Template Literals are a new ES2015 / ES6 feature that allows you to work with strings. The syntax is very simple, just use backticks instead of single or double quotes:
let a = `something something`;
and to make multiline strings just press enter to create a new line, with no special characters:
let a = `something
something`;
The results are exactly the same as you write in the string.
In ES6 you can build strings like this:
const a = `something ${'\xa0'.repeat(10)} something`
Just add any space between ` ` and print variables inside with ${var}
You can use the <pre> tag with innerHTML. The HTML <pre> element represents preformatted text which is to be presented exactly as written in the HTML file. The text is typically rendered using a non-proportional ("monospace") font. Whitespace inside this element is displayed as written. If you don't want a different font, simply add pre as a selector in your CSS file and style it as desired.
Ex:
var a = '<pre>something something</pre>';
document.body.innerHTML = a;
I don't have this problem with the string variable itself, but only when the string is converted into html.
One can use replace and a regex to translate spaces into protected spaces replace(/ /g, '\xa0').
var a = 'something' + ' ' + 'something'
p1.innerHTML = a
p2.innerHTML = a.replace(/ /g, '\xa0')
<p id="p1"></p>
<p id="p2"></p>
BTW, if you input many spaces into contenteditable, they are translated as alternating sequences of spaces and protected spaces as you can try here:
<p contenteditable onkeyup="result.value = this.innerHTML">put many space into this editable paragraph and see the results in the textarea</p>
<textarea id="result"></textarea>

Regex replace text outside html tag

I'm working on an autocomplete component that highlights all ocurrences of searched text. What I do is explode the input text by words, and wrap every ocurrence of those words into a
My code looks like this
inputText = 'marriott st';
text = "Marriott east side";
textSearch = inputText.split(' ');
for (var i in textSearch) {
var regexSearch = new RegExp('(?!<\/?strong>)' + textSearch[i]), "i");
var textReplaced = regexSearch.exec(text);
text = text.replace(regexSearch, '< strong>' + textReplaced + '< /strong>');
}
For example, given the result: "marriott east side"
And the input text: "marriott st"
I should get
<strong>marriot< /strong > ea < strong >st < /strong > side
And i'm getting
<<strong>st</strong>rong>marriot</<strong>st </strong>rong>ea<<strong>st</strong> rong>s</strong> side
Any ideas how can I improve my regex, in order to avoid ocurrences inside the html tags? Thanks
/(?!<\/?strong>)st/
I would process the string in one pass. You can create one regular expression out of the search string:
var search_pattern = '(' + inputText.replace(/\s+/g, '|') + ')';
// `search_pattern` is now `(marriot|st)`
text = text.replace(RegExp(search_pattern, 'gi'), '<strong>$1</strong>');
DEMO
You could even split the search string first, sort the words by length and combine them, to give a higher precedence to longer matches.
You definitely should escape special regex characters inside the string: How to escape regular expression special characters using javascript?.
Before each search, I suggest getting (or saving) the original search string to work on each time. For example, in your current case that means you could replace all '<strong>' and '</strong>' tags with ''. This will help keep your regEx simple, especially if you decide to add other html tags and formatting in the future.

Categories

Resources