Arabic text zero width joiners not working between elements

Arabic text zero width joiners not working between elements - javascript

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with
<span class="highlight">keyword</span>
The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.
I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:
Match at the Beginning of a word
var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");
var newSpan = "<span class='highlight'>$1‍</span>‍";
Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)
var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");
var newSpan = "‍$1‍<span class='highlight'>‍$2‍</span>‍$3‍";
Match at the End of a word
var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");
var newSpan = "‍<span class='highlight'>‍$1</span>";
Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:
للأبد
transforms into:
ل‍‍ل‍‍أ‍بد
when the keyword is:
ل
I've tried other various combinations of ‍ but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?
Thanks!
A few extra notes:
This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue:
https://bugs.webkit.org/show_bug.cgi?id=6148
This question is an extension on these two stackoverflow questions:
Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)
Partially colored Arabic word in HTML

Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode.
You can find the Unicode table here
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
You can get the Unicode value using
var code = $(selector).text().charCodeAt(0);

I suggest not to separate this ligature, but to extend the <span> tag to enclose the entire lam+alif structure for highlighting.
According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (ل‍‌‍ا), not like the required ligature (لا).
Seems to me most browsers/fonts adhere to this requirement.
My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).

Related

Regex appears to ignore multiple piped characters

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)

For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

Strange unselectable character in JavaScript output

I have a JavaScript object which I'm using to populate a form element (with jQuery):
var attribute = { name : '̈́Type' };
$('#container').html('<input type="text" value="'+attribute.name+'/>');
But the output shows a strange character which is not selectable:
This character is also present when trying:
alert(attribute.name); //in Firefox
console.log(attribute.name); //in Chrome
My JavaScript file has UTF8 encoding.
What is this character and how do I make it go away?

The strange character is a unicode diacritic (\u0344) and is applied to the first single quote ' on the { name : '̈́Type' }declaration.
Just delete the offending single quote and retype it.
You have got something akin to this:
var strange_character = ' \u0344';
var attribute = { name : 'Type' };
$('#container').html('<input type="text" value="'+strange_character + attribute.name+'"/>');

This strange character has code U+0344 is called COMBINING GREEK DIALYTIKA TONOS.
Description:
U+0344 was added to Unicode in version 1.1. It belongs to the block
Combining Diacritical Marks in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property
from the preceding character.
The glyph is a Canonical composition of the glyphs U+0301 and U+0308. It has a
Ambiguous East Asian Width. In bidirectional context it acts as
Nonspacing Mark and is not mirrored. In text U+0344 behaves as
Combining Mark regarding line breaks. It has type Extend for sentence
and Extend for word breaks. The Grapheme Cluster Break is Extend.
REF: http://codepoints.net/U+0344

If you will zoom on your question really close, you will see its already there. You just need to retype it.

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!

First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Need to identify the non-matching character in regex

We use a regex to test for 'illegal' characters when a user provides a 'Version Name' before we save their content. The accepted characters are: A-Z, 0-9 and blank space. We test this using the following:
var version_name = document.getElementById('txtSaveVersionName').value;
if(version_name.search(/[^A-Za-z0-9\s]/)!= -1){
alert("Warning illegal characters have been removed etc");
version_name.replace(/[^A-Za-z0-9\s]/g,'');
document.getElementById('txtSaveVersionName').value = version_name;
}
This works fine when a user keys their version name. However the version name can also be populated from data taken from a dynamically populated select box - version names loaded in from our system.
When this occurs, the regexp throws out the space in the name. So "My Version" becomes "MyVersion"? This does not occur when the user types "My Version".
So it appears that the value taken from the select box contains a character that looks like a space but is not. I have copied this value from the text box into a unicode converter (http://rishida.net/tools/conversion/) that identifies the characters underlying values and both sets are reported as 0020 (space), yet only ones throws an exception??
Is there a way to identify what the character is that is causing this issue?
Any thoughts greatly appreciated!
Cheers
Mark

Try:
var str= getSelectBoxValue();
var rez = "";
for (var i=0;i<str.length;i++)
rez = rez+str[i]+"["+str.charCodeAt(i)+"]";
alert(rez);
It should give you the unicode values of all the characters in the string the way Javascript sees them. When you copy it from the screen, it could be the browser/OS that converts some weird UTF character into regular "0x20" character for some reason.

I noticed you have a bug in your code:
version_name.replace(/[^A-Za-z0-9\s]/g,'');
Should be
version_name = version_name.replace(/[^A-Za-z0-9\s]/g,'');
As, of course, replace creates a new string, it doesn't modify the existing string.
As you are finding that the replace sometimes works and sometimes doesn't I
would suspect that you have implimented this correctly in one place and incorrectly in another.

removing phpbb tag using regex javascript

I'm trying to remove a rectangular brackets(bbcode style) using javascript, this is for removing unwanted bbcode.
I try with this.
theString .replace(/\[quote[^\/]+\]*\[\/quote\]/, "")
it works with this string sample:
theString = "[quote=MyName;225]Test 123[/quote]";
it will fail within this sample:
theString = "[quote=MyName;225]Test [quote]inside quotes[/quote]123[/quote]";
if there any solution beside regex no problem

The other 2 solutions simply do not work (see my comments). To solve this problem you first need to craft a regex which matches the innermost matching quote elements (which contain neither [QUOTE..] nor [/QUOTE]). Next, you need to iterate, applying this regex over and over until there are no more QUOTE elements left. This tested function does what you want:
function filterQuotes(text)
{ // Regex matches inner [QUOTE]non-quote-stuff[/quote] tag.
var re = /\[quote[^\[]+(?:(?!\[\/?quote\b)\[[^\[]*)*\[\/quote\]/ig;
while (text.search(re) !== -1)
{ // Need to iterate removing QUOTEs from inside out.
text = text.replace(re, "");
}
return text;
}
Note that this regex employs Jeffrey Friedl's "Unrolling the loop" efficiency technique and is not only accurate, but is quite fast to boot.
See: Mastering Regular Expressions (3rd Edition) (highly recommended).

Try this one:
/\[quote[^\/]+\].*\[\/quote\]$/
The $ sign indicates that only the closing quote element at the end of the string should be used to determine the ending of the quote you're trying to remove.
And i added a "." before the asterisk so that this will match any sign in between. I tested this with your two strings and it worked.
edit: I don't exactly know how you are using that. But just as an addition. If you want the pattern also to match to a string where no attributes are added for example:
[quote]Hello[/quote]
You should change the "+" sign into an asterisk as well like this:
/\[quote[^\/]*\].*\[\/quote\]$/

This answer has flaws, see Ridgerunner's answer for a more correct one.
Here's my crack at it.
function filterQuotes(text)
{
return text.replace(/\[(\/)?quote([^\/]*)?\]/g,"");
}

Develop Reference

JavaScript is the programming language of the Web.

Arabic text zero width joiners not working between elements - javascript

Related

Regex appears to ignore multiple piped characters

Strange unselectable character in JavaScript output

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

Need to identify the non-matching character in regex

removing phpbb tag using regex javascript

Categories

Resources