Regex appears to ignore multiple piped characters - javascript

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)

For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

Related

Why would the replace with regex not work even though the regex does?

There may be a very simple answer to this, probably because of my familiarity (or possibly lack thereof) of the replace method and how it works with regex.
Let's say I have the following string: abcdefHellowxyz
I just want to strip the first six characters and the last four, to return Hello, using regex... Yes, I know there may be other ways, but I'm trying to explore the boundaries of what these methods are capable of doing...
Anyway, I've tinkered on http://regex101.com and got the following Regex worked out:
/^(.{6}).+(.{4})$/
Which seems to pass the string well and shows that abcdef is captured as group 1, and wxyz captured as group 2. But when I try to run the following:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
to replace those captured groups with "" I receive an empty string as my final output... Am I doing something wrong with this syntax? And if so, how does one correct it, keeping my original stance on wanting to use Regex in this manner...
Thanks so much everyone in advance...
The code below works well as you wish
"abcdefHellowxyz".replace(/^.{6}(.+).{4}$/,"$1")
I think that only use ()to capture the text you want, and in the second parameter of replace(), you can use $1 $2 ... to represent the group1 group2.
Also you can pass a function to the second parameter of replace,and transform the captured text to whatever you want in this function.
For more detail, as #Akxe recommend , you can find document on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace.
You are replacing any substring that matches /^(.{6}).+(.{4})$/, with this line of code:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
The regex matches the whole string "abcdefHellowxyz"; thus, the whole string is replaced. Instead, if you are strictly stripping by the lengths of the extraneous substrings, you could simply use substring or substr.
Edit
The answer you're probably looking for is capturing the middle token, instead of the outer ones:
var str = "abcdefHellowxyz";
var matches = str.match(/^.{6}(.+).{4}$/);
str = matches[1]; // index 0 is entire match
console.log(str);

JavaScript RegEx match unless wrapped with [nocode][/nocode] tags

My current code is:
var user_pattern = this.settings.tag;
user_pattern = user_pattern.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); // escape regex
var pattern = new RegExp(user_pattern.replace(/%USERNAME%/i, "(\\S+)"), "ig");
Where this.settings.tag is a string such as "[user=%USERNAME%]" or "#%USERNAME%". The code uses pattern.exec(str) to find any username in the corresponding tag and works perfectly fine. For example, if str = "Hello, [user=test]" then pattern.exec(str) will find test.
This works fine, but I want to be able to stop it from matching if the string is wrapped in [nocode][/nocode] tags. For example, if str = "[nocode]Hello, [user=test], how are you?[/nocode]" thenpattern.exec(str)` should not match anything.
I'm not quite sure where to start. I tried using a (?![nocode]) before and after the pattern, but to no avail. Any help would be great.
I would just test if the string starts with [nocode] first:
/^\[nocode\]/.test('[nocode]');
Then simply do not process it.
Maybe filter out [nocode] before trying to find the username(s)?
pattern.exec(str.replace(/\[nocode\](.*)\[\/nocode\]/g,''));
I know this isn't exactly what you asked for because now you have to use two separate regular expressions, however code readability is important too and doing it this way is definitely better in that aspect. Hope this helps 😉
JSFiddle: http://jsfiddle.net/1f485Lda/1/
It's based on this: Regular Expression to get a string between two strings in Javascript

removing phpbb tag using regex javascript

I'm trying to remove a rectangular brackets(bbcode style) using javascript, this is for removing unwanted bbcode.
I try with this.
theString .replace(/\[quote[^\/]+\]*\[\/quote\]/, "")
it works with this string sample:
theString = "[quote=MyName;225]Test 123[/quote]";
it will fail within this sample:
theString = "[quote=MyName;225]Test [quote]inside quotes[/quote]123[/quote]";
if there any solution beside regex no problem
The other 2 solutions simply do not work (see my comments). To solve this problem you first need to craft a regex which matches the innermost matching quote elements (which contain neither [QUOTE..] nor [/QUOTE]). Next, you need to iterate, applying this regex over and over until there are no more QUOTE elements left. This tested function does what you want:
function filterQuotes(text)
{ // Regex matches inner [QUOTE]non-quote-stuff[/quote] tag.
var re = /\[quote[^\[]+(?:(?!\[\/?quote\b)\[[^\[]*)*\[\/quote\]/ig;
while (text.search(re) !== -1)
{ // Need to iterate removing QUOTEs from inside out.
text = text.replace(re, "");
}
return text;
}
Note that this regex employs Jeffrey Friedl's "Unrolling the loop" efficiency technique and is not only accurate, but is quite fast to boot.
See: Mastering Regular Expressions (3rd Edition) (highly recommended).
Try this one:
/\[quote[^\/]+\].*\[\/quote\]$/
The $ sign indicates that only the closing quote element at the end of the string should be used to determine the ending of the quote you're trying to remove.
And i added a "." before the asterisk so that this will match any sign in between. I tested this with your two strings and it worked.
edit: I don't exactly know how you are using that. But just as an addition. If you want the pattern also to match to a string where no attributes are added for example:
[quote]Hello[/quote]
You should change the "+" sign into an asterisk as well like this:
/\[quote[^\/]*\].*\[\/quote\]$/
This answer has flaws, see Ridgerunner's answer for a more correct one.
Here's my crack at it.
function filterQuotes(text)
{
return text.replace(/\[(\/)?quote([^\/]*)?\]/g,"");
}

How do I extract the title value from a string using Javascript regexp?

I have a string variable which I would like to extract the title value in id="resultcount" element. The output should be 2.
var str = '<table cellpadding=0 cellspacing=0 width="99%" id="addrResults"><tr></tr></table><span id="resultcount" title="2" style="display:none;">2</span><span style="font-size: 10pt">2 matching results. Please select your address to proceed, or refine your search.</span>';
I tried the following regex but it is not working:
/id=\"resultcount\" title=['\"][^'\"](+['\"][^>]*)>/
Since var str = ... is Javascript syntax, I assume you need a Javascript solution. As Peter Corlett said, you can't parse HTML using regular expressions, but if you are using jQuery you can use it to take advantage of browser own parser without effort using this:
$('#resultcount', '<div>'+str+'</div>').attr('title')
It will return undefined if resultcount is not found or it has not a title attribute.
To make sure it doesn't matter which attribute (id or title) comes first in a string, take entire html element with required id:
var tag = str.replace(/^.*(<[^<]+?id=\"resultcount\".+?\/.+?>).*$/, "$1")
Then find title from previous string:
var res = tag.replace(/^.*title=\"(\d+)\".*$/, "$1");
// res is 2
But, as people have previously mentioned it is unreliable to use RegEx for parsing html, something as trivial as different quote (single instead of double quote) or space in "wrong" place will brake it.
Please see this earlier response, entitled "You can't parse [X]HTML with regex":
RegEx match open tags except XHTML self-contained tags
Well, since no one else is jumping in on this and I'm assuming you're just looking for a value and not trying to create a parser, I'll give you what works for me with PCRE. I'm not sure how to put it into the java format for you but I think you'll be able to do that.
span id="resultcount" title="(\d+)"
The part you're looking to get is the non-passive group $1 which is the '\d+' part. It will get one or more digits between the quote marks.

JavaScript regex: Not starting with

I want to replace all the occurrences of a string that doesn't start with "<pre>" and doesn't end in "</pre>".
So let's say I wanted to find new-line characters and replace them with "<p/>". I can get the "not followed by" part:
var revisedHtml = html.replace(/[\n](?![<][/]pre[>])/g, "<p/>");
But I don't know the "not starting with" part to put at the front.
Any help please? :)
Here's how Steve Levithan's first lookbehind-alternative can be applied to your problem:
var output = s.replace(/(<pre>[\s\S]*?<\/pre>)|\n/g, function($0, $1){
return $1 ? $1 : '<p/>';
});
When it reaches a <pre> element, it captures the whole thing and plugs it right back into the output. It never really sees the newlines inside the element, just gobbles them up along with all other content. Thus, when the \n in the regex does match a newline, you know it's not inside a <pre> element, and should be replaced with a <p/>.
But don't make the mistake of regarding this technique as a hack or a workaround; I would recommend this approach even if lookbehinds were available. With the lookaround approach, the regex has to examine every single newline and apply the lookarounds each time to see if it should be replaced. That's a lot of unnecessary work it has to do, plus the regex is a lot more complicated and less maintainable.
As always when using regexes on HTML, I'm ignoring a lot of factors that can affect the result, like SGML comments, CDATA sections, angle brackets in attribute values, etc. You'll have to determine which among those factors you have to deal with in your case, and which ones you can ignore. When it comes to processing HTML with regexes, there's no such thing as a general solution.
Why not do the reverse. Look for all the substrings enclosed in <pre> tags. Then you know which parts of your string are not enclosed in <pre>.
EDIT: More elegant solution: use split() and use the <pre> HTML as the delimiters. This gives you the HTML outside the <pre> blocks.
var s = "blah blah<pre>formatted</pre>blah blah<pre>another formatted</pre>end";
var rgx = /<pre>.*?<\/pre>/g
var nonPreStrings = s.split(rgx);
for (var idx in nonPreStrings)
alert(nonPreStrings[idx]);

Categories

Resources