Regex to strip BBCode - javascript

I need a regular expression to strip out any BBCode in a string. I've got the following (and an array with tags):
new RegExp('\\[' + tags[index] + '](.*?)\\[/' + tags[index] + ']');
It picks up [tag]this[/tag] just fine, but fails when using [url=http://google.com]this[/url].
What do I need to change? Thanks a lot.

I came across this thread and found it helpful to get me on the right track, but here's an ultimate one I spent two hours building (it's my first RegEx!) for JavaScript and tested to work very well for crazy nests and even incorrectly nested strings, it just works!:
string = string.replace(/\[\/?(?:b|i|u|url|quote|code|img|color|size)*?.*?\]/img, '');
If string = "[b][color=blue][url=www.google.com]Google[/url][/color][/b]" then the new string will be "Google". Amazing.
Hope someone finds that useful, this was a top match for 'JavaScript RegEx strip BBCode' in Google ;)

You have to allow any character other than ']' after a tag until you find ' ]'.
new RegExp('\\[' + tags[index] + '[^]]*](.*?)\\[/' + tags[index] + ']');
You could simplify this to the following expression.
\[[^]]*]([^[]*)\[\\[^]]*]
The problem with that is, that it will match [WrongTag]stuff[\WrongTag], too. Matching nested tags requires using the expression multiple times.

You can check for balanced tags using a backreference:
new RegExp('\\[(' + tags.Join('|') + ')[^]]*](.*?)\\[/\\1]');
The real problem is that you cant't match arbitrary nested tags in a regular expression (that's the limit of a regular language). Some languages do allow for recursive regular expressions, but those are extensions (that technically make them non-regular, but doesn't change the name that most people use for the objects).
If you don't care about balanced tags, you can just strip out any tag you find:
new RegExp('\\[/?(?:' + tags.Join('|') + ')[^]]*]');

To strip out any BBCode, use something like:
string alltags = tags.Join("|");
RegExp stripbb = new RegExp('\\[/?(' + alltags + ')[^]]*\\]');
Replace globally with the empty string. No extra loop necessary.

I had a similar problem - in PHP not Javascript - I had to strip out BBCode [quote] tags and also the quotes within the tags. Added problem in that there is often arbitrary additional stuff inside the [quote] tag, e.g. [quote:7e3af94210="username"]
This worked for me:
$post = preg_replace('/[\r\n]+/', "\n", $post);
$post = preg_replace('/\[\s*quote.*\][^[]*\[\s*\/quote.*\]/im', '', $post);
$post = trim($post);
lines 1 and 3 are just to tidy up any extra newlines, and any that are left over as a result of the regex.

I think
new RegExp('\\[' + tags[index] + '(=[^\\]]+)?](.*?)\\[/' + tags[index] + ']');
should do it. Instead of group 1 you have to pick group 2 then.

Remember that many (most?) regex flavours by default do not let the DOT meta character match line terminators. Causing a tag like
"[foo]dsdfs
fdsfsd[/foo]"
to fail. Either enable DOTALL by adding "(?s)" to your regex, or replace the DOT meta char in your regex by the character class [\S\s].

this worked for me, for every tag name. it also supports strings like '[url="blablabla"][/url]'
str = str.replace(/\[([a-z]+)(\=[\w\d\.\,\\\/\"\'\#\,\-]*)*( *[a-z0-9]+\=.+)*\](.*?)\[\/\1\]/gi, "$4")

Related

Regex with multiple start and end characters that must be the same

I would like to be able to search for strings inside a special tag in a string in JavaScript. Strings in JavaScript can start with either " or ' character.
Here an example to illustrate what I want to do. My custom tag is called <my-tag. My regex is /('|")*?<my-tag>((.|\n)[^"']*?)<\/my-tag>*?('|")/g. I use this regex pattern on the following strings:
var a = '<my-tag>Hello World</my-tag>'; //is found as expected
var b = "<my-tag>Hello World" + '</my-tag>'; //is NOT found, this is good!
var c = "<my-tag>Hello World</my-tag>"; //is found as expected
var d = '<my-tag>something "special"</my-tag>'; //here the " char causes a problem
var e = "<my-tag>something 'special'</my-tag>"; //here the " char causes a problem
It works well with a and also c where it finds the tag with the containing text. It also does not find the text in b which is what I want. But in case d and e the tag with content is not found due to the occurrence of the " and ' character. What I want is a regex where inside the tag " is allowed if the string is start with ', and vice versa.
Is it possible to achieve this with one regex, or is the only thing I can do is to work with two separate regex expressions like
/(")*?<my-tag>((.|\n)[^']*?)<\/my-tag>*?(")/g and /(')*?<my-tag>((.|\n)[^"]*?)<\/my-tag>*?(')/g ?
It's not pretty, but I think this would work:
/("<my-tag>((.|\n)[^"]*?)<\/my-tag>"|'<my-tag>((.|\n)[^']*?)<\/my-tag>')/g
You should be able to use de match from the first match ('|") and reuse it for the second match. Something like the following:
/('|")<my-tag>.*?<\/my-tag>\1/g
This should make sure to match the same character at the beginning and the end.
But you really shouldn't use regex for parsing HTML.

Splitting string with javascript using '>' character

I acknowledge that this question has probably been asked so many times before and I have tried searching all over StackOverflow for a solution, but so far nothing has worked for me.
I want to split a string but it's not working properly and spitting out individual characters as each item in an array. The string I have from my CMS uses ">" characters to separate and I am using regEx to replace the 'greater than' symbol - with a comma, which works. Sourced this solution from Regex that detects greater than ">" and less than "<" in a string
However, the arrays remain incorrectly formed, like the split() function does not even work:
var myString = "TEST Public Libraries Connect > News Blog > A new item"
var regEx = /<|>/g;
var myNewString = (myString.replace(regEx,","))
alert(myNewString);
myNewString.split(",");
alert(myNewString[0]);
alert(myNewString[1]);
alert(myNewString[2]);
I've put it up in a Fiddle as well, just confused as to why the split won't work properly. Is it because there is spaces in the string?
This should work:
var myNewString = myString.split(">");
https://jsfiddle.net/2j56cva0/3/
In your fiddle, you were splitting myNewString instead of the actual string.
myNewString.split(",");
You need to assign the result of the split to something. It does not just change the string itself into an array.
var parts = myNewString.split(",");

Javascript Regular expression to remove unwanted <br>,

I have a JS stirng like this
<div id="grouplogo_nav"><br> <ul><br> <li><a class="group_hlfppt" target="_blank" href="http://www.hlfppt.org/">&nbsp;</a></li><br> </ul><br> </div>
I need to remove all <br> and $nbsp; that are only between > and <. I tried to write a regular expression, but didn't got it right. Does anybody have a solution.
EDIT :
Please note i want to remove only the tags b/w > and <
Avoid using regex on html!
Try creating a temporary div from the string, and using the DOM to remove any br tags from it. This is much more robust than parsing html with regex, which can be harmful to your health:
var tempDiv = document.createElement('div');
tempDiv.innerHTML = mystringwithBRin;
var nodes = tempDiv.childNodes;
for(var nodeId=nodes.length-1; nodeId >= 0; --nodeId) {
if(nodes[nodeId].tagName === 'br') {
tempDiv.removeChild(nodes[nodeId]);
}
}
var newStr = tempDiv.innerHTML;
Note that we iterate in reverse over the child nodes so that the node IDs remain valid after removing a given child node.
http://jsfiddle.net/fxfrt/
myString = myString.replace(/^( |<br>)+/, '');
... where /.../ denotes a regular expression, ^ denotes start of string, ($nbsp;|<br>) denotes " or <br>", and + denotes "one or more occurrence of the previous expression". And then simply replace that full match with an empty string.
s.replace(/(>)(?: |<br>)+(\s?<)/g,'$1$2');
Don't use this in production. See the answer from Phil H.
Edit: I try to explain it a bit and hope my english is good enough.
Basically we have two different kinds of parentheses here. The first pair and third pair () are normal parentheses. They are used to remember the characters that are matched by the enclosed pattern and group the characters together. For the second pair, we don't need to remember the characters for later use, so we disable the "remember" functionality by using the form (?:) and only group the characters to make the + work as expected. The + quantifier means "one or more occurrences", so or <br> must be there one or more times. The last part (\s?<) matches a whitespace character (\s), which can be missing or occur one time (?), followed by the characters <. $1 and $2 are kind of variables that are replaces by the remembered characters of the first and third parentheses.
MDN provides a nice table, which explains all the special characters.
You need to replace globally. Also don't forget that you can have the being closed . Try this:
myString = myString.replace(/( |<br>|<br \/>)/g, '');
This worked for me, please note for the multi lines
myString = myString.replace(/( |<br>|<br \/>)/gm, '');
myString = myString.replace(/^( |<br>)+/, '');
hope this helps

Converting XML tags to uppercase using a javascript regex

I'm trying to convert XML tags to uppercase, while preserving the case of attributes and text. So for example
<Mytag Category="Parent">Value1</Mytag>
Becomes
<MYTAG Category="Parent">Value1</MYTAG>
I have a regex which matches the XML tags correctly, but the upperCase function does not seem to be working.
myXmlElement.replace(/<(\/)*([a-zA-Z_0-9]+)([^>]*)>/g,"<$1" + "$2".toUpperCase() + "$3>")
I've also tried using String.prototype.toUpperCase.apply("$2"), as well as passing a function as the replace argument
myXmlElement.replace(/<[\/]*([a-zA-Z_0-9]+)[^>]*>/g,
function($1,$2,$3){return <$1 + $2.toUpperCase() + $3>})
But this doesn't work, as $1,$2,$3 appear to refer to the entire matching elements ($1 = , $2 = )
I'm sure there is something trivial I am overlooking here, can anybody help out?
If you want to match the characters before and after your tag name, the need to be put into matching braces within the pattern:
var pattern = /<([\/]*)([a-zA-Z_0-9]+)([^>]*)>/g
var newTag = myElement.replace(pattern, function(full, before, tag, after) {
return "<" before + tag.toUpperCase() + after + ">"
})
The replacement function will take the full matching expression as first argument. That's why you simply may ignore it.
After that any matching brace of your pattern will be passed as a parameter.

JavaScript RegExp help for BBCode

I have this RegExp expression I found couple weeks ago
/([\r\n])|(?:\[([a-z\*]{1,16})(?:=([^\x00-\x1F"'\(\)<>\[\]]{1,256}))?\])|(?:\[\/([a-z]{1,16})\])/ig
And it's working to find the BBCode tags such as [url] and [code].
However if I try [url="http://www.google.com"] it won't match. I'm not very good at RegExp and I can't figure out how to still be valid but the ="http://www.google.com" be optional.
This also fails for [color="red"] but figure it is the same issue the url tag is having.
This part: [^\x00-\x1F"'\(\)<>\[\]] says that after the =there must not be a ". That means your regexp matches [url=http://stackoverflow.com]. If you want to have quotes you can simply put them around your capturing group:
/([\r\n])|(?:\[([a-z\*]{1,16})(?:="([^\x00-\x1F"'\(\)<>\[\]]{1,256})")?\])|(?:\[\/([a-z]{1,16})\])/gi
I think you would benefit from explicitly enumerating all the tags you want to match, since it should allow matching the closing tag more specifically.
Here's a sample code:
var tags = [ 'url', 'code', 'b' ]; // add more tags
var regParts = tags.map(function (tag) {
return '(\\[' + tag + '(?:="[^"]*")?\\](?=.*?\\[\\/' + tag + '\\]))';
});
var re = new RegExp(regParts.join('|'), 'g');
You might notice that the regular expression is composed from a set of smaller ones, each representing a single tag with a possible attribute ((?:="[^"]*")?, see explanation below) of variable length, like [url="google.com"], and separated with the alternation operator |.
(="[^"]*")? means an = symbol, then a double quote, followed by any symbol other than double quote ([^"]) in any quantity, i.e. 0 or more, (*), followed by a closing quote. The final ? means that the whole group may not be present at all.

Categories

Resources