JavaScript RegExp help for BBCode - javascript

I have this RegExp expression I found couple weeks ago
/([\r\n])|(?:\[([a-z\*]{1,16})(?:=([^\x00-\x1F"'\(\)<>\[\]]{1,256}))?\])|(?:\[\/([a-z]{1,16})\])/ig
And it's working to find the BBCode tags such as [url] and [code].
However if I try [url="http://www.google.com"] it won't match. I'm not very good at RegExp and I can't figure out how to still be valid but the ="http://www.google.com" be optional.
This also fails for [color="red"] but figure it is the same issue the url tag is having.

This part: [^\x00-\x1F"'\(\)<>\[\]] says that after the =there must not be a ". That means your regexp matches [url=http://stackoverflow.com]. If you want to have quotes you can simply put them around your capturing group:
/([\r\n])|(?:\[([a-z\*]{1,16})(?:="([^\x00-\x1F"'\(\)<>\[\]]{1,256})")?\])|(?:\[\/([a-z]{1,16})\])/gi

I think you would benefit from explicitly enumerating all the tags you want to match, since it should allow matching the closing tag more specifically.
Here's a sample code:
var tags = [ 'url', 'code', 'b' ]; // add more tags
var regParts = tags.map(function (tag) {
return '(\\[' + tag + '(?:="[^"]*")?\\](?=.*?\\[\\/' + tag + '\\]))';
});
var re = new RegExp(regParts.join('|'), 'g');
You might notice that the regular expression is composed from a set of smaller ones, each representing a single tag with a possible attribute ((?:="[^"]*")?, see explanation below) of variable length, like [url="google.com"], and separated with the alternation operator |.
(="[^"]*")? means an = symbol, then a double quote, followed by any symbol other than double quote ([^"]) in any quantity, i.e. 0 or more, (*), followed by a closing quote. The final ? means that the whole group may not be present at all.

Related

javascript regex insert new element into expression

I am passing a URL to a block of code in which I need to insert a new element into the regex. Pretty sure the regex is valid and the code seems right but no matter what I can't seem to execute the match for regex!
//** Incoming url's
//** url e.g. api/223344
//** api/11aa/page/2017
//** Need to match to the following
//** dir/api/12ab/page/1999
//** Hence the need to add dir at the front
var url = req.url;
//** pass in: /^\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var re = myregex.toString();
//** Insert dir into regex: /^dir\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var regVar = re.substr(0, 2) + 'dir' + re.substr(2);
var matchedData = url.match(regVar);
matchedData === null ? console.log('NO') : console.log('Yay');
I hope I am just missing the obvious but can anyone see why I can't match and always returns NO?
Thanks
Let's break down your regex
^\/api\/ this matches the beginning of a string, and it looks to match exactly the string "/api"
([a-zA-Z0-9-_~ %]+) this is a capturing group: this one specifically will capture anything inside those brackets, with the + indicating to capture 1 or more, so for example, this section will match abAB25-_ %
(?:\/page\/([a-zA-Z0-9-_~ %]+)) this groups multiple tokens together as well, but does not create a capturing group like above (the ?: makes it non-captuing). You are first matching a string exactly like "/page/" followed by a group exactly like mentioned in the paragraph above (that matches a-z, A-Z, 0-9, etc.
?$ is at the end, and the ? means capture 0 or more of the precending group, and the $ matches the end of the string
This regex will match this string, for example: /api/abAB25-_ %/page/abAB25-_ %
You may be able to take advantage of capturing groups, however, and use something like this instead to get similar results: ^\/api\/([a-zA-Z0-9-_~ %]+)\/page\/\1?$. Here, we are using \1 to reference that first capturing group and match exactly the same tokens it is matching. EDIT: actually, this probably won't work, since the text after /api/ and the text after /page/ will most likely be different, carrying on...
Afterwards, you are are adding "dir" to the beginning of your search, so you can now match someting like this: dir/api/abAB25-_ %/page/abAB25-_ %
You have also now converted the regex to a string, so like Crayon Violent pointed out in their comment, this will break your expected funtionality. You can fix this by using .source on your regex: var matchedData = url.match(regVar.source); https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/source
Now you can properly match a string like this: dir/api/11aa/page/2017 see this example: https://repl.it/Mj8h
As mentioned by Crayon Violent in the comments, it seems you're passing a String rather than a regular expression in the .match() function. maybe try the following:
url.match(new RegExp(regVar, "i"));
to convert the string to a regular expression. The "i" is for ignore case; don't know that's what you want. Learn more here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

Regex converting & to &

I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being &copy;.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.

Skipping over tags and spaces in regex html

I'm using this regex to find a String that starts with !?, ends with ?!, and has another variable inbetween (in this example "a891d050"). This is what I use:
var pattern = new RegExp(/!\\?.*\s*(a891d050){1}.*\s*\\?!/);
It matches correctly agains this one:
!?v8qbQ5LZDnFLsny7VmVe09HJFL1/WfGD2A:::a891d050?!
But fails when the string is broken up with html tags.
<span class="userContent"><span>!?v8qbQ5LZDnFLsny7VmVe09HJFL1/</span><wbr /><span class="word_break"></span>WfGD2A:::a891d050?!</span></div></div></div></div>
I tried adding \s and {space}*, but it still fails.
The question is, what (special?)characters do I need to account for if I want to ignore whitespace and html tags in my match.
edit: this is how I use the regex:
var pattern = /!\?[\s\S]*a891d050[\s\S]*\?!/;
document.body.innerHTML = document.body.innerHTML.replace(pattern,"new content");
It appears to me that when it encounters the 'plain' string it replaces is correctly. But when faced with String with classes around it and inside, it makes a mess of the classes or doesn't replace at all depending on the context. So I decided to try jquery-replacetext-plugin(as it promises to leave tags as they were) like this:
$("body *").replaceText( pattern, "new content" );
But with no success, the results are the same as before.
Maybe this:
var pattern = /!\?[\s\S]*a891d050[\s\S]*\?!/;
[\s\S] should match any character. I have also removed {1}.
The problem was apparently solved by using this regex:
var pattern = /(!\?)(?:<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*?>)?(.)*?(a891d050)(?:<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*?>)?(.)*?(\?!)/;

Javascript Regular expression to remove unwanted <br>,

I have a JS stirng like this
<div id="grouplogo_nav"><br> <ul><br> <li><a class="group_hlfppt" target="_blank" href="http://www.hlfppt.org/">&nbsp;</a></li><br> </ul><br> </div>
I need to remove all <br> and $nbsp; that are only between > and <. I tried to write a regular expression, but didn't got it right. Does anybody have a solution.
EDIT :
Please note i want to remove only the tags b/w > and <
Avoid using regex on html!
Try creating a temporary div from the string, and using the DOM to remove any br tags from it. This is much more robust than parsing html with regex, which can be harmful to your health:
var tempDiv = document.createElement('div');
tempDiv.innerHTML = mystringwithBRin;
var nodes = tempDiv.childNodes;
for(var nodeId=nodes.length-1; nodeId >= 0; --nodeId) {
if(nodes[nodeId].tagName === 'br') {
tempDiv.removeChild(nodes[nodeId]);
}
}
var newStr = tempDiv.innerHTML;
Note that we iterate in reverse over the child nodes so that the node IDs remain valid after removing a given child node.
http://jsfiddle.net/fxfrt/
myString = myString.replace(/^( |<br>)+/, '');
... where /.../ denotes a regular expression, ^ denotes start of string, ($nbsp;|<br>) denotes " or <br>", and + denotes "one or more occurrence of the previous expression". And then simply replace that full match with an empty string.
s.replace(/(>)(?: |<br>)+(\s?<)/g,'$1$2');
Don't use this in production. See the answer from Phil H.
Edit: I try to explain it a bit and hope my english is good enough.
Basically we have two different kinds of parentheses here. The first pair and third pair () are normal parentheses. They are used to remember the characters that are matched by the enclosed pattern and group the characters together. For the second pair, we don't need to remember the characters for later use, so we disable the "remember" functionality by using the form (?:) and only group the characters to make the + work as expected. The + quantifier means "one or more occurrences", so or <br> must be there one or more times. The last part (\s?<) matches a whitespace character (\s), which can be missing or occur one time (?), followed by the characters <. $1 and $2 are kind of variables that are replaces by the remembered characters of the first and third parentheses.
MDN provides a nice table, which explains all the special characters.
You need to replace globally. Also don't forget that you can have the being closed . Try this:
myString = myString.replace(/( |<br>|<br \/>)/g, '');
This worked for me, please note for the multi lines
myString = myString.replace(/( |<br>|<br \/>)/gm, '');
myString = myString.replace(/^( |<br>)+/, '');
hope this helps

Regex to strip BBCode

I need a regular expression to strip out any BBCode in a string. I've got the following (and an array with tags):
new RegExp('\\[' + tags[index] + '](.*?)\\[/' + tags[index] + ']');
It picks up [tag]this[/tag] just fine, but fails when using [url=http://google.com]this[/url].
What do I need to change? Thanks a lot.
I came across this thread and found it helpful to get me on the right track, but here's an ultimate one I spent two hours building (it's my first RegEx!) for JavaScript and tested to work very well for crazy nests and even incorrectly nested strings, it just works!:
string = string.replace(/\[\/?(?:b|i|u|url|quote|code|img|color|size)*?.*?\]/img, '');
If string = "[b][color=blue][url=www.google.com]Google[/url][/color][/b]" then the new string will be "Google". Amazing.
Hope someone finds that useful, this was a top match for 'JavaScript RegEx strip BBCode' in Google ;)
You have to allow any character other than ']' after a tag until you find ' ]'.
new RegExp('\\[' + tags[index] + '[^]]*](.*?)\\[/' + tags[index] + ']');
You could simplify this to the following expression.
\[[^]]*]([^[]*)\[\\[^]]*]
The problem with that is, that it will match [WrongTag]stuff[\WrongTag], too. Matching nested tags requires using the expression multiple times.
You can check for balanced tags using a backreference:
new RegExp('\\[(' + tags.Join('|') + ')[^]]*](.*?)\\[/\\1]');
The real problem is that you cant't match arbitrary nested tags in a regular expression (that's the limit of a regular language). Some languages do allow for recursive regular expressions, but those are extensions (that technically make them non-regular, but doesn't change the name that most people use for the objects).
If you don't care about balanced tags, you can just strip out any tag you find:
new RegExp('\\[/?(?:' + tags.Join('|') + ')[^]]*]');
To strip out any BBCode, use something like:
string alltags = tags.Join("|");
RegExp stripbb = new RegExp('\\[/?(' + alltags + ')[^]]*\\]');
Replace globally with the empty string. No extra loop necessary.
I had a similar problem - in PHP not Javascript - I had to strip out BBCode [quote] tags and also the quotes within the tags. Added problem in that there is often arbitrary additional stuff inside the [quote] tag, e.g. [quote:7e3af94210="username"]
This worked for me:
$post = preg_replace('/[\r\n]+/', "\n", $post);
$post = preg_replace('/\[\s*quote.*\][^[]*\[\s*\/quote.*\]/im', '', $post);
$post = trim($post);
lines 1 and 3 are just to tidy up any extra newlines, and any that are left over as a result of the regex.
I think
new RegExp('\\[' + tags[index] + '(=[^\\]]+)?](.*?)\\[/' + tags[index] + ']');
should do it. Instead of group 1 you have to pick group 2 then.
Remember that many (most?) regex flavours by default do not let the DOT meta character match line terminators. Causing a tag like
"[foo]dsdfs
fdsfsd[/foo]"
to fail. Either enable DOTALL by adding "(?s)" to your regex, or replace the DOT meta char in your regex by the character class [\S\s].
this worked for me, for every tag name. it also supports strings like '[url="blablabla"][/url]'
str = str.replace(/\[([a-z]+)(\=[\w\d\.\,\\\/\"\'\#\,\-]*)*( *[a-z0-9]+\=.+)*\](.*?)\[\/\1\]/gi, "$4")

Categories

Resources