I have a JavaScript object which I'm using to populate a form element (with jQuery):
var attribute = { name : '̈́Type' };
$('#container').html('<input type="text" value="'+attribute.name+'/>');
But the output shows a strange character which is not selectable:
This character is also present when trying:
alert(attribute.name); //in Firefox
console.log(attribute.name); //in Chrome
My JavaScript file has UTF8 encoding.
What is this character and how do I make it go away?
The strange character is a unicode diacritic (\u0344) and is applied to the first single quote ' on the { name : '̈́Type' }declaration.
Just delete the offending single quote and retype it.
You have got something akin to this:
var strange_character = ' \u0344';
var attribute = { name : 'Type' };
$('#container').html('<input type="text" value="'+strange_character + attribute.name+'"/>');
This strange character has code U+0344 is called COMBINING GREEK DIALYTIKA TONOS.
Description:
U+0344 was added to Unicode in version 1.1. It belongs to the block
Combining Diacritical Marks in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property
from the preceding character.
The glyph is a Canonical composition of the glyphs U+0301 and U+0308. It has a
Ambiguous East Asian Width. In bidirectional context it acts as
Nonspacing Mark and is not mirrored. In text U+0344 behaves as
Combining Mark regarding line breaks. It has type Extend for sentence
and Extend for word breaks. The Grapheme Cluster Break is Extend.
REF: http://codepoints.net/U+0344
If you will zoom on your question really close, you will see its already there. You just need to retype it.
Related
I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with
<span class="highlight">keyword</span>
The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.
I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:
Match at the Beginning of a word
var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");
var newSpan = "<span class='highlight'>$1</span>";
Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)
var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");
var newSpan = "$1<span class='highlight'>$2</span>$3";
Match at the End of a word
var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");
var newSpan = "<span class='highlight'>$1</span>";
Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:
للأبد
transforms into:
للأبد
when the keyword is:
ل
I've tried other various combinations of but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?
Thanks!
A few extra notes:
This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue:
https://bugs.webkit.org/show_bug.cgi?id=6148
This question is an extension on these two stackoverflow questions:
Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)
Partially colored Arabic word in HTML
Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode.
You can find the Unicode table here
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
You can get the Unicode value using
var code = $(selector).text().charCodeAt(0);
I suggest not to separate this ligature, but to extend the <span> tag to enclose the entire lam+alif structure for highlighting.
According to http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237, ZWJ works as ZWJ+ZWNJ+ZWJ between ل(lam) and ا(alif). It should be rendered as a connected lam followed by a connected alif (لا), not like the required ligature (لا).
Seems to me most browsers/fonts adhere to this requirement.
My answer applies to other ligatures as well, if you use them in your application (non-required ones, e.g.: mim + mim).
I've researched stackoverflow and find similar results but it is not really what I wanted.
Given an xml string: "<a b=\"c\"></a>" in javascript context, I want to create a regex that will capture the attribute value including the quotation marks.
NOTE: this is similar if you're using single quotation marks.
Currently I have a regular expression tailored to the XML specification:
[_A-Za-z][\w\.\-]*(?:=\"[^\"]*\")?
[_A-Za-z][\w\.\-]* //This will match the attribute name.
(?:=\"[^\"]*\")? //This will match the attribute value.
\"[^\"]*\" //This part concerns me.
My question now is, what if the xml string looks like this:
<shout statement="Hi! \"Richeve\"."></shout>
I know this is a dumb question to ask but I just want to capture rare cases that this scenario might happen (I know the coder can use single quotes on this scenario) but there are cases that we don't know the current value of the attribute given that the attribute value changes dynamically at runtime.
So to make this clearer, the result of that using the correct regex should be:
"Hi! \"Richeve\"."
I hope my question is clear. Thanks for all the help!
PS: Note that the language context is Javascript and I know it is tempting to use lookbehinds but currently lookbehinds are not supported.
PS: I know it is really hard to parse XML but I have an elegant solution to this :) so I just need this small problem to be solved. So this problem only main focus is capturing quotation marked string tokens containing quotation marks inside the string token.
The standard pattern for content with matching delimiters and embedded escaped delimiters goes like this:
"[^"\\]*(?:\\.[^"\\]*)*"
Ignoring the obvious first and last characters in the pattern, here's how the rest of the pattern works:
[^"\\]*: Consume all characters until a delimiter OR backslash (matching Hi! in your example)
(?:\\.[^"\\]*)* Try to consume a single escaped character \\. followed by a series of non delimiter/backslash characters, repeatedly (matching \"Richeve first and then \". next in your example)
That's it.
You can try to use a more generic delimiter approach using (['"]) and back references, or you can just allow for an alternate pattern with single quotes like so:
("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')
Here's another description of this technique that might also help (see the section called Strings): http://www.regular-expressions.info/examplesprogrammer.html
Description
I'm pretty really sure embedding double quotes inside a double quoted attribute value is not legal. You could use the unicode equivalent of a double quote \x22 inside the value.
However to answer the question, this expression will:
allow escaped quotes inside attribute values
capture the attribute statement 's value
allow attributes to appear in any order inside the tag
will avoid many of the edge cases which will trip up pattern matching inside html text
doesn't use lookbehinds
<shout\b(?=\s)(?=(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*?\sstatement=(['"])((?:\\['"]|.)*?)\1(?:\s|\/>|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/shout>
Example
Pretty Rubular
Ugly RegexPlanet set to Javascript
Sample Text
Note the difficult edge case in the first attribute :)
<shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
Matches
Group 0 gets the entire tag from open to close
Group 1 gets the quote surrounding the statement attribute value, this is used to match the closing quote correctly
Group 2 gets the statement attribute value which may include escaped quotes like \" but not including the surrounding quotes
[0][0] = <shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
[0][1] = "
[0][2] = Hi! \"Richeve\".
We use a regex to test for 'illegal' characters when a user provides a 'Version Name' before we save their content. The accepted characters are: A-Z, 0-9 and blank space. We test this using the following:
var version_name = document.getElementById('txtSaveVersionName').value;
if(version_name.search(/[^A-Za-z0-9\s]/)!= -1){
alert("Warning illegal characters have been removed etc");
version_name.replace(/[^A-Za-z0-9\s]/g,'');
document.getElementById('txtSaveVersionName').value = version_name;
}
This works fine when a user keys their version name. However the version name can also be populated from data taken from a dynamically populated select box - version names loaded in from our system.
When this occurs, the regexp throws out the space in the name. So "My Version" becomes "MyVersion"? This does not occur when the user types "My Version".
So it appears that the value taken from the select box contains a character that looks like a space but is not. I have copied this value from the text box into a unicode converter (http://rishida.net/tools/conversion/) that identifies the characters underlying values and both sets are reported as 0020 (space), yet only ones throws an exception??
Is there a way to identify what the character is that is causing this issue?
Any thoughts greatly appreciated!
Cheers
Mark
Try:
var str= getSelectBoxValue();
var rez = "";
for (var i=0;i<str.length;i++)
rez = rez+str[i]+"["+str.charCodeAt(i)+"]";
alert(rez);
It should give you the unicode values of all the characters in the string the way Javascript sees them. When you copy it from the screen, it could be the browser/OS that converts some weird UTF character into regular "0x20" character for some reason.
I noticed you have a bug in your code:
version_name.replace(/[^A-Za-z0-9\s]/g,'');
Should be
version_name = version_name.replace(/[^A-Za-z0-9\s]/g,'');
As, of course, replace creates a new string, it doesn't modify the existing string.
As you are finding that the replace sometimes works and sometimes doesn't I
would suspect that you have implimented this correctly in one place and incorrectly in another.
I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g
I'm using Fancy Upload 3 and onSelect of a file I need to run a check to make sure the user doesn't have any bad characters in the filename. I'm currently getting people uploading files with hieroglyphics and such in the names.
What I need is to check if the filename only contains:
A-Z
a-z
0-9
_ (underscore)
- (minus)
SPACE
ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü (as single and double byte)
Obviously you can see the difficult thing there. The non-english single and double byte chars.
I've seen this:
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
And this:
[\x80-\xA5]
But neither of them fully cover the situation right.
Examples that should work:
fást.zip
abc.zip
ABC.zip
Über.zip
Examples that should NOT work:
∑∑ø∆.zip
¡wow!.zip
•§ªº¶.zip
The following is close, but I'm NO RegEx'pert, not even close.
var filenameReg = /^[A-Za-z0-9-_]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF]+$/;
Thanks in advance.
Solution from Zafer mostly works, but it does not catch all of the other symbols, see below.
Uncaught:
¡£¢§¶ª«ø¨¥®´åß©¬æ÷µç
Caught:
™∞•–≠'"πˆ†∑œ∂ƒ˙∆˚…≥≤˜∫√≈Ω
Regex:
var filenameReg = /^([A-Za-z0-9\-_. ]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF])+$/;
Alternation between two character classes (ie. [abc]|[def]) can be simplified to a single character class ([abcdef]) -- the first can be read as "(a or b or c) OR (d or e or f)"; the second as "(a or b or c or d or e or f)". What probably tripped up your regular expression is the unescaped dash in the first class -- if you want a literal dash, it should be the last character in the class.
So we'll modify your expression to get it working:
var filenameReg = /^[A-Za-z0-9_\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF-]+$/;
The problem now is that you're not accounting for the file extension, but that is an easy modification (assuming you're always getting .zip files):
var filenameReg = /^[A-Za-z0-9_\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF-]+\.zip$/;
Replace zip with another pattern if the extension differs.
It looks like it is the character ranges that are causing the problem, because they include some unallowable characters in between. Since you already have the list of allowable characters, the best thing would be to just use that directly:
var filenameReg = /^[A-Za-z0-9_\-\ ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü]+$/;
The following should work:
var filenameReg = /^([A-Za-z0-9\-_. ]|[\x00A0-\xD7FF\xF900-\xFDCF\xFDF0-\xFFEF])+$/;
I've put \ next to - and grouped two expressions otherwise + sign doesn't affect the first expression.
EDIT 1 :I've also put . in the expression.
We have diffrent rules for diffrent platforms. But I think you mean long file names in windows. For that you can use following RegEx:
var longFilenames = #"^[^\./:*\?\""<>\|]{1}[^\/:*\?\""<>\|]{0,254}$";
NOTE: Instead of saying which Character is allowed, you need to say which ones are not allowed!
But keep in mind that this is not 100% complete RegEx. If you really want to make it complete you have to add exceptions for reserved names as well.
You can find more information about filename rules here:
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx