JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string - javascript

I'm writing some javascript that processes website content. My efforts are being thwarted by SharePoint text editor's tendency to put the "zero width space" character in the text when the user presses backspace.
The character's unicode value is 8203, or B200 in hexadecimal. I've tried to use the default "replace" function to get rid of it. I've tried many variants, none of them worked:
var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u8203/g,'');
= a.replace(/\uB200/g,'');
= a.replace("\\uB200",'');
and so on and so forth. I've tried quite a few variations on this theme. None of these expressions work (tested in Chrome and Firefox) The only thing that works is typing the actual character in the expression:
var b = a.replace("​",''); //it's there, believe me
This poses potential problems. The character is invisible so that line in itself doesn't make sense. I can get around that with comments. But if the code is ever reused, and the file is saved using non-Unicode encoding, (or when it's deployed to SharePoint, there's not guarantee it won't mess up encoding) it will stop working. Is there a way to write this using the unicode notation instead of the character itself?
[My ramblings about the character]
In case you haven't met this character, (and you probably haven't, seeing as it's invisible to the naked eye, unless it broke your code and you discovered it while trying to locate the bug) it's a real a-hole that will cause certain types of pattern matching to malfunction. I've caged the beast for you:
[​] <- careful, don't let it escape.
If you want to see it, copy those brackets into a text editor and then iterate your cursor through them. You'll notice you'll need three steps to pass what seems like 2 characters, and your cursor will skip a step in the middle.

The number in a unicode escape should be in hex, and the hex for 8203 is 200B (which is indeed a Unicode zero-width space), so:
var b = a.replace(/\u200B/g,'');
Live Example:
var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u200B/g,'');
console.log("a.length = " + a.length); // 3
console.log("a === 'om'? " + (a === 'om')); // false
console.log("b.length = " + b.length); // 2
console.log("b === 'om'? " + (b === 'om')); // true

The accepted answer didn't work for my case.
But this one did:
text.replace(/(^[\s\u200b]*|[\s\u200b]*$)/g, '')

Related

Check strings equality without direction? [duplicate]

How can I remove non-printable unicode characters in a multi-language input?
When users with different localizations paste strings they will sometimes unintentionally embed non-printing characters. For example:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC"
var displaysAs = decodeURI(weird); // Users see only "Test"
But I can't figure out how to strip the non-printing characters in a way that doesn't impact other languages like these:
encodeURI("شنط") = "%D8%B4%D9%86%D8%B7"
encodeURI("戦艦帝国") = "%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
For example, the following attempt to repair the weird example above doesn't work:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC";
var displaysAs = decodeURI(weird);
var stillWeird = encodeURI(displaysAs.replace(/\s/g, ""));
// value is again "%E2%80%AA%E2%80%8ETest%E2%80%AC"
console.log('before:', weird);
console.log('after:', displaysAs);
console.log('again:', stillWeird);
.as-console-wrapper{min-height:100%}
As noted in the comments, this is largely a specification problem. I don't have an enumeration of non-printing unicode expressions. I can only observe that one can paste a unicode string into a browser Input and not be aware that it has undisplayed characters in it. I assume that some logic in the browser determines whether each unicode character will display something. This problem would be solved if I can apply that same logic to the underlying string in order to get the "display string."
Put another way: For any two unicode strings that look identical on the browser, I need a transformation that guarantees that their values are identical.
You can use the regular expression found in this other answer.
Example with array of the three strings provided in the question:
let weird = [
"%E2%80%AA%E2%80%8ETest%E2%80%AC",
"%D8%B4%D9%86%D8%B7",
"%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
];
const expr = /[\0-\x1F\x7F-\x9F\xAD\u0378\u0379\u037F-\u0383\u038B\u038D\u03A2\u0528-\u0530\u0557\u0558\u0560\u0588\u058B-\u058E\u0590\u05C8-\u05CF\u05EB-\u05EF\u05F5-\u0605\u061C\u061D\u06DD\u070E\u070F\u074B\u074C\u07B2-\u07BF\u07FB-\u07FF\u082E\u082F\u083F\u085C\u085D\u085F-\u089F\u08A1\u08AD-\u08E3\u08FF\u0978\u0980\u0984\u098D\u098E\u0991\u0992\u09A9\u09B1\u09B3-\u09B5\u09BA\u09BB\u09C5\u09C6\u09C9\u09CA\u09CF-\u09D6\u09D8-\u09DB\u09DE\u09E4\u09E5\u09FC-\u0A00\u0A04\u0A0B-\u0A0E\u0A11\u0A12\u0A29\u0A31\u0A34\u0A37\u0A3A\u0A3B\u0A3D\u0A43-\u0A46\u0A49\u0A4A\u0A4E-\u0A50\u0A52-\u0A58\u0A5D\u0A5F-\u0A65\u0A76-\u0A80\u0A84\u0A8E\u0A92\u0AA9\u0AB1\u0AB4\u0ABA\u0ABB\u0AC6\u0ACA\u0ACE\u0ACF\u0AD1-\u0ADF\u0AE4\u0AE5\u0AF2-\u0B00\u0B04\u0B0D\u0B0E\u0B11\u0B12\u0B29\u0B31\u0B34\u0B3A\u0B3B\u0B45\u0B46\u0B49\u0B4A\u0B4E-\u0B55\u0B58-\u0B5B\u0B5E\u0B64\u0B65\u0B78-\u0B81\u0B84\u0B8B-\u0B8D\u0B91\u0B96-\u0B98\u0B9B\u0B9D\u0BA0-\u0BA2\u0BA5-\u0BA7\u0BAB-\u0BAD\u0BBA-\u0BBD\u0BC3-\u0BC5\u0BC9\u0BCE\u0BCF\u0BD1-\u0BD6\u0BD8-\u0BE5\u0BFB-\u0C00\u0C04\u0C0D\u0C11\u0C29\u0C34\u0C3A-\u0C3C\u0C45\u0C49\u0C4E-\u0C54\u0C57\u0C5A-\u0C5F\u0C64\u0C65\u0C70-\u0C77\u0C80\u0C81\u0C84\u0C8D\u0C91\u0CA9\u0CB4\u0CBA\u0CBB\u0CC5\u0CC9\u0CCE-\u0CD4\u0CD7-\u0CDD\u0CDF\u0CE4\u0CE5\u0CF0\u0CF3-\u0D01\u0D04\u0D0D\u0D11\u0D3B\u0D3C\u0D45\u0D49\u0D4F-\u0D56\u0D58-\u0D5F\u0D64\u0D65\u0D76-\u0D78\u0D80\u0D81\u0D84\u0D97-\u0D99\u0DB2\u0DBC\u0DBE\u0DBF\u0DC7-\u0DC9\u0DCB-\u0DCE\u0DD5\u0DD7\u0DE0-\u0DF1\u0DF5-\u0E00\u0E3B-\u0E3E\u0E5C-\u0E80\u0E83\u0E85\u0E86\u0E89\u0E8B\u0E8C\u0E8E-\u0E93\u0E98\u0EA0\u0EA4\u0EA6\u0EA8\u0EA9\u0EAC\u0EBA\u0EBE\u0EBF\u0EC5\u0EC7\u0ECE\u0ECF\u0EDA\u0EDB\u0EE0-\u0EFF\u0F48\u0F6D-\u0F70\u0F98\u0FBD\u0FCD\u0FDB-\u0FFF\u10C6\u10C8-\u10CC\u10CE\u10CF\u1249\u124E\u124F\u1257\u1259\u125E\u125F\u1289\u128E\u128F\u12B1\u12B6\u12B7\u12BF\u12C1\u12C6\u12C7\u12D7\u1311\u1316\u1317\u135B\u135C\u137D-\u137F\u139A-\u139F\u13F5-\u13FF\u169D-\u169F\u16F1-\u16FF\u170D\u1715-\u171F\u1737-\u173F\u1754-\u175F\u176D\u1771\u1774-\u177F\u17DE\u17DF\u17EA-\u17EF\u17FA-\u17FF\u180F\u181A-\u181F\u1878-\u187F\u18AB-\u18AF\u18F6-\u18FF\u191D-\u191F\u192C-\u192F\u193C-\u193F\u1941-\u1943\u196E\u196F\u1975-\u197F\u19AC-\u19AF\u19CA-\u19CF\u19DB-\u19DD\u1A1C\u1A1D\u1A5F\u1A7D\u1A7E\u1A8A-\u1A8F\u1A9A-\u1A9F\u1AAE-\u1AFF\u1B4C-\u1B4F\u1B7D-\u1B7F\u1BF4-\u1BFB\u1C38-\u1C3A\u1C4A-\u1C4C\u1C80-\u1CBF\u1CC8-\u1CCF\u1CF7-\u1CFF\u1DE7-\u1DFB\u1F16\u1F17\u1F1E\u1F1F\u1F46\u1F47\u1F4E\u1F4F\u1F58\u1F5A\u1F5C\u1F5E\u1F7E\u1F7F\u1FB5\u1FC5\u1FD4\u1FD5\u1FDC\u1FF0\u1FF1\u1FF5\u1FFF\u200B-\u200F\u202A-\u202E\u2060-\u206F\u2072\u2073\u208F\u209D-\u209F\u20BB-\u20CF\u20F1-\u20FF\u218A-\u218F\u23F4-\u23FF\u2427-\u243F\u244B-\u245F\u2700\u2B4D-\u2B4F\u2B5A-\u2BFF\u2C2F\u2C5F\u2CF4-\u2CF8\u2D26\u2D28-\u2D2C\u2D2E\u2D2F\u2D68-\u2D6E\u2D71-\u2D7E\u2D97-\u2D9F\u2DA7\u2DAF\u2DB7\u2DBF\u2DC7\u2DCF\u2DD7\u2DDF\u2E3C-\u2E7F\u2E9A\u2EF4-\u2EFF\u2FD6-\u2FEF\u2FFC-\u2FFF\u3040\u3097\u3098\u3100-\u3104\u312E-\u3130\u318F\u31BB-\u31BF\u31E4-\u31EF\u321F\u32FF\u4DB6-\u4DBF\u9FCD-\u9FFF\uA48D-\uA48F\uA4C7-\uA4CF\uA62C-\uA63F\uA698-\uA69E\uA6F8-\uA6FF\uA78F\uA794-\uA79F\uA7AB-\uA7F7\uA82C-\uA82F\uA83A-\uA83F\uA878-\uA87F\uA8C5-\uA8CD\uA8DA-\uA8DF\uA8FC-\uA8FF\uA954-\uA95E\uA97D-\uA97F\uA9CE\uA9DA-\uA9DD\uA9E0-\uA9FF\uAA37-\uAA3F\uAA4E\uAA4F\uAA5A\uAA5B\uAA7C-\uAA7F\uAAC3-\uAADA\uAAF7-\uAB00\uAB07\uAB08\uAB0F\uAB10\uAB17-\uAB1F\uAB27\uAB2F-\uABBF\uABEE\uABEF\uABFA-\uABFF\uD7A4-\uD7AF\uD7C7-\uD7CA\uD7FC-\uF8FF\uFA6E\uFA6F\uFADA-\uFAFF\uFB07-\uFB12\uFB18-\uFB1C\uFB37\uFB3D\uFB3F\uFB42\uFB45\uFBC2-\uFBD2\uFD40-\uFD4F\uFD90\uFD91\uFDC8-\uFDEF\uFDFE\uFDFF\uFE1A-\uFE1F\uFE27-\uFE2F\uFE53\uFE67\uFE6C-\uFE6F\uFE75\uFEFD-\uFF00\uFFBF-\uFFC1\uFFC8\uFFC9\uFFD0\uFFD1\uFFD8\uFFD9\uFFDD-\uFFDF\uFFE7\uFFEF-\uFFFB\uFFFE\uFFFF]/g;
weird.map(decodeURI).forEach(el => {
let trimmed = el.replace(expr, '')
console.log(trimmed, trimmed.length);
});
If you only want to trim these non-printable characters from the beginning and/or end of the string, you'll need to assert start (^) and end ($) in the regular expression.

JavaScript - Why does this code alert a message?

I don't know much about JavaScript, but I found this code as a part of some game engine code. I tried to inspect it, because I noticed this part of code alerts a message and I really cannot figure out how. Here is the minimal code (I reduced it and extracted from original script and I changed variable names to single letters):
var a = '͏‪͏‪‪‪‪‪͏͏‪‪‪‪͏‪͏͏‪͏͏‪‪‪͏‪͏‪‪͏‪‪͏‪‪‪‪‪‪͏͏‪͏‪‪͏‪‪͏͏‪͏‪͏͏͏͏‪‪‪͏͏͏͏͏‪‪͏‪‪͏‪͏‪‪‪͏͏͏‪͏‪‪‪͏‪‪‪͏‪‪‪͏‪͏͏͏‪‪‪‪͏‪‪͏‪‪͏‪‪‪͏͏‪‪‪‪͏‪‪͏‪‪‪‪‪͏͏͏‪‪‪‪‪͏‪͏‪‪‪‪‪͏͏͏‪‪‪‪͏‪‪͏‪‪‪͏‪͏͏͏‪‪‪‪‪͏‪͏‪‪‪‪͏͏‪͏‪‪‪͏͏͏͏͏‪‪‪‪‪͏͏͏‪‪‪‪‪͏‪͏‪‪͏‪‪͏‪͏‪‪‪͏͏͏‪͏‪‪‪͏‪‪‪͏‪‪‪‪‪͏͏͏‪‪‪‪͏‪‪͏‪͏‪‪‪͏‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏͏͏͏͏‪͏‪͏͏͏͏‪‪‪͏‪͏‪͏‪‪‪͏͏͏‪͏‪‪͏‪‪‪͏͏‪‪‪͏͏‪͏͏‪‪‪‪‪͏͏͏‪‪‪‪‪͏‪͏‪‪‪‪͏͏‪͏‪‪‪͏‪‪͏͏‪‪‪‪͏‪͏͏‪‪‪͏‪‪‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏‪‪͏‪‪‪͏͏͏‪͏‪‪‪͏͏‪‪͏‪‪‪͏͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪͏͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪͏͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪‪͏‪‪͏‪‪͏‪‪͏‪͏‪‪‪‪͏‪͏͏‪‪‪͏‪‪‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏͏‪͏͏‪͏‪͏͏͏͏͏‪͏‪͏͏‪͏‪‪‪‪‪‪͏͏‪͏‪‪‪͏‪͏‪͏‪‪͏‪‪͏͏‪͏‪͏͏͏͏‪‪‪͏‪͏‪͏‪͏‪‪͏‪‪͏‪‪‪‪‪͏‪͏‪‪‪‪͏͏‪͏‪͏‪‪͏‪‪͏‪‪‪‪͏͏͏͏‪͏‪‪‪͏‪͏‪‪‪‪‪͏‪͏‪͏‪‪͏‪‪͏‪‪͏‪‪‪‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏͏‪͏‪‪‪͏͏͏‪͏‪‪‪͏‪‪‪͏‪‪͏‪‪‪͏͏‪‪‪‪‪͏͏͏‪‪‪‪‪͏‪͏‪‪‪͏͏‪͏͏‪͏‪‪‪͏‪͏‪͏‪‪‪͏‪͏‪‪‪‪‪͏‪͏͏‪͏‪͏‪‪͏‪‪͏‪‪‪‪͏‪‪‪‪‪͏͏͏‪‪‪‪͏͏‪͏͏‪͏͏͏͏‪͏‪‪‪͏‪͏‪͏‪‪‪͏͏͏‪͏͏‪͏‪͏‪‪͏͏‪͏͏͏͏‪͏‪‪‪‪‪͏‪͏‪‪‪‪͏‪͏͏‪͏‪‪͏‪‪͏͏‪͏͏͏͏‪͏‪‪‪‪‪͏‪͏‪‪‪‪͏‪‪͏‪‪‪‪͏͏‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏‪‪͏͏‪͏‪͏͏͏͏‪‪‪‪͏͏͏͏͏‪͏͏͏͏‪͏‪‪͏‪‪‪‪͏‪‪‪͏‪‪‪͏͏‪͏‪͏‪‪͏‪͏‪‪͏‪‪͏‪‪‪‪‪͏‪͏‪͏‪‪‪͏‪͏‪‪‪͏‪‪͏͏‪‪‪‪͏͏͏͏͏‪͏‪͏‪‪͏‪͏‪‪͏‪‪͏‪‪‪‪‪‪͏͏͏‪͏‪͏͏‪͏‪‪‪͏‪͏‪͏‪‪‪‪͏͏͏͏‪‪‪‪‪͏‪͏͏‪͏‪͏͏‪͏‪‪‪͏‪͏͏͏‪‪‪‪͏͏‪͏‪‪͏‪‪‪‪͏‪͏‪‪͏‪‪͏‪‪͏‪‪‪‪͏‪͏‪‪‪͏‪͏‪‪‪‪‪͏‪͏͏‪͏‪͏͏‪͏‪͏‪‪͏‪‪͏‪‪‪͏͏‪‪͏‪͏‪‪‪͏‪͏͏‪͏‪͏‪‪͏‪‪‪‪͏‪͏͏‪‪‪͏͏‪͏͏‪‪‪‪‪͏͏͏͏‪͏͏͏͏‪͏‪‪‪‪‪͏‪͏‪‪‪‪͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪‪‪͏͏͏‪‪‪‪͏‪‪͏‪‪‪͏͏͏‪͏‪͏‪‪͏‪‪͏‪‪‪‪͏‪‪͏‪‪‪͏͏͏͏͏‪͏‪‪͏‪‪͏‪‪‪‪‪‪͏͏‪‪‪‪‪͏‪͏͏‪͏‪͏͏‪͏‪‪‪‪͏‪͏͏‪‪‪‪͏‪‪͏‪‪͏‪‪‪‪͏‪͏‪‪͏‪‪͏‪͏‪‪‪͏‪͏‪‪͏‪‪‪͏͏‪‪‪‪͏͏‪͏‪‪‪͏‪͏‪͏‪‪‪‪͏‪͏͏‪‪͏‪‪͏‪͏‪‪‪‪͏͏‪͏͏‪͏͏͏͏‪͏‪‪‪͏‪‪͏͏͏‪͏͏‪‪‪͏͏‪‪‪͏‪‪͏‪‪͏͏‪‪͏͏‪‪͏‪‪‪‪͏‪‪‪͏͏‪͏͏͏‪͏‪͏͏͏͏‪‪͏‪͏͏‪͏͏‪͏͏͏͏͏͏‪‪͏‪‪‪‪͏‪‪͏͏‪‪͏͏͏‪͏͏‪‪‪͏‪‪͏‪‪͏‪͏‪‪͏‪‪‪͏͏‪‪͏‪‪‪‪͏‪‪‪͏͏͏͏͏‪‪‪͏͏͏‪͏‪‪‪͏͏‪͏͏‪‪‪͏͏‪‪͏‪‪‪͏‪͏͏͏‪‪‪͏‪͏‪͏‪‪‪͏‪‪͏͏‪‪‪͏‪‪‪͏‪‪‪‪͏͏͏͏‪‪‪‪͏͏‪͏‪‪‪‪͏‪͏͏‪‪‪‪͏‪‪͏‪‪‪‪‪͏͏͏‪‪‪‪‪͏‪͏‪‪‪‪‪‪͏͏͏‪͏͏‪‪‪͏͏‪͏‪͏͏‪͏‪‪‪͏‪‪‪͏‪‪͏‪͏͏‪͏‪‪‪͏‪͏͏͏‪‪͏‪͏͏͏͏͏‪͏‪͏͏͏͏‪͏‪‪‪‪‪͏͏‪͏‪‪‪͏͏‪‪‪͏͏‪‪͏‪‪‪͏͏͏͏͏‪‪͏‪‪͏͏͏‪‪͏‪͏͏‪͏‪‪‪͏‪͏͏͏͏‪͏‪͏͏͏͏‪‪͏‪͏͏‪͏͏‪͏‪͏͏‪͏͏‪͏‪͏͏‪͏‪͏‪‪‪‪‪͏͏‪‪‪‪͏‪͏‪‪͏‪͏‪͏͏‪‪͏‪‪‪‪͏‪‪͏‪͏͏‪͏‪‪͏‪‪‪͏͏͏‪͏‪͏͏͏͏‪‪‪͏͏͏͏͏‪‪͏‪‪‪‪͏‪‪‪͏͏͏͏͏͏‪͏‪͏͏͏͏͏‪͏‪͏͏‪͏͏‪͏‪͏͏‪͏͏‪‪‪͏‪‪͏‪‪͏͏‪͏‪͏‪‪‪͏‪‪͏͏‪‪͏͏͏͏‪͏‪‪͏‪‪͏͏͏͏‪͏‪͏͏͏͏‪͏‪‪‪‪‪͏͏‪͏‪͏͏‪';
var b = a.match(/.{8}/g);
var c = b.map(a => [...a].map(a => a == '‪' | 0));
var d = c.map(a => parseInt(a.join``, 2).toString(16));
var e = d.map(a => eval(`'\\x${a.padStart(2, 0)}'`));
var f = eval(e.join``);
I'm trying to understand how they succeed to alert a message. It alerts number 12345, but how? I see some evals here, so I suppose they are making code on the fly, but still I tried using debugger but I couldn't find explanation. They are somehow generating code and executing it, I'm still unable to see how.
I tried this code in jsFiddle and it still works and I tried in Node.js and it throw error alert is not defined, so I am pretty sure everything this code does is to alert a message.
What trick did they use here? How are they making and evaling code and how do they succeed to alert a message? Is this some sort of encription or what?
My question has absolutely nothing to do with this question.
The code is all there, hidden in the variable a. No, it's not an empty string, its a string consisting of 1888 invisible characters - either \u034f or \u202a to be precise. So this is in fact just a disguised binary encoding.
The code part
var b = a.match(/.{8}/g);
var c = b.map(a => [...a].map(a => a == '‪' | 0));
var d = c.map(a => parseInt(a.join``, 2).toString(16));
breaks them in chunks of 8, then converts each chunk from an array of characters to an array of booleans (or rather, the integers 0 and 1) - notice that it compares the character against the invisible \u202a, and then converts each array-of-8-booleans (oh look, an octet!) into an actual byte and gets a hex representation of it. Here's the hex string (d.join('')):
5f3d275b7e5b28706d7177747b6e7b7c7d7c7b747d79707c7d6d71777c7b5d5d282875716e727c7d79767a775d2b7173737b737b7b737b7b7b6d7a775d2928297e5d5b28755b7d795b785d7d5b6f5d2971776e7c7d725d5d7d2b6f7c792175712b217d7a5b217d7b795d2b2878216f772b5b7d5d76782b5b7e2975787d2974796f5b6f5d7d295b735d2b7a727c217d7b7b7c7b715b7b705b7e7d297a7b6f5b5d6e79757a6d792176273b666f722869206f66276d6e6f707172737475767778797a7b7c7d7e272977697468285f2e73706c6974286929295f3d6a6f696e28706f702829293b6576616c285f29
The part
d.map(a => eval(`'\\x${a.padStart(2, 0)}'`));
has each of them parsed into a character, using a backslash escape. String.fromCharCode would have been the simpler choice. Also the padStart is not even required here, given that none of the bytes is a control character with a byte value less than 16. Maybe this would've been more familiar:
"\x5f\x3d\x27\x5b\x7e\x5b\x28\x70\x6d\x71\x77\x74\x7b\x6e\x7b\x7c\x7d\x7c\x7b\x74\x7d\x79\x70\x7c\x7d\x6d\x71\x77\x7c\x7b\x5d\x5d\x28\x28\x75\x71\x6e\x72\x7c\x7d\x79\x76\x7a\x77\x5d\x2b\x71\x73\x73\x7b\x73\x7b\x7b\x73\x7b\x7b\x7b\x6d\x7a\x77\x5d\x29\x28\x29\x7e\x5d\x5b\x28\x75\x5b\x7d\x79\x5b\x78\x5d\x7d\x5b\x6f\x5d\x29\x71\x77\x6e\x7c\x7d\x72\x5d\x5d\x7d\x2b\x6f\x7c\x79\x21\x75\x71\x2b\x21\x7d\x7a\x5b\x21\x7d\x7b\x79\x5d\x2b\x28\x78\x21\x6f\x77\x2b\x5b\x7d\x5d\x76\x78\x2b\x5b\x7e\x29\x75\x78\x7d\x29\x74\x79\x6f\x5b\x6f\x5d\x7d\x29\x5b\x73\x5d\x2b\x7a\x72\x7c\x21\x7d\x7b\x7b\x7c\x7b\x71\x5b\x7b\x70\x5b\x7e\x7d\x29\x7a\x7b\x6f\x5b\x5d\x6e\x79\x75\x7a\x6d\x79\x21\x76\x27\x3b\x66\x6f\x72\x28\x69\x20\x6f\x66\x27\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x27\x29\x77\x69\x74\x68\x28\x5f\x2e\x73\x70\x6c\x69\x74\x28\x69\x29\x29\x5f\x3d\x6a\x6f\x69\x6e\x28\x70\x6f\x70\x28\x29\x29\x3b\x65\x76\x61\x6c\x28\x5f\x29"
This string is the one evaled in the last line. But surprise, the contents of that string are just
_='[~[(pmqwt{n{|}|{t}yp|}mqw|{]]((uqnr|}yvzw]+qss{s{{s{{{mzw])()~][(u[}y[x]}[o])qwn|}r]]}+o|y!uq+!}z[!}{y]+(x!ow+[}]vx+[~)ux})tyo[o]})[s]+zr|!}{{|{q[{p[~})z{o[]nyuzmy!v';for(i of'mnopqrstuvwxyz{|}~')with(_.split(i))_=join(pop());eval(_)
So what does - still obfuscated - code do?
var _='[~[(pmqwt{n{|}|{t}yp|}mqw|{]]((uqnr|}yvzw]+qss{s{{s{{{mzw])()~][(u[}y[x]}[o])qwn|}r]]}+o|y!uq+!}z[!}{y]+(x!ow+[}]vx+[~)ux})tyo[o]})[s]+zr|!}{{|{q[{p[~})z{o[]nyuzmy!v';
for (var i of 'mnopqrstuvwxyz{|}~')
with (_.split(i))
_=join(pop());
eval(_)
Removing the with magic, we get
for (var i of 'mnopqrstuvwxyz{|}~') {
let temp = _.split(i);
_ = temp.join(temp.pop());
}
So for all of these characters from m to z, it splits _ by that, takes the last part out, and joins it back together, effectively
replacing m by y!v,
replacing n by yuz,
replacing o by [],
replacing p by [~})z{,
replacing q by [{,
replacing r by |!}{{|{,
replacing s by ]+z,
replacing t by y[][[]]})[,
replacing u by x}),
replacing v by x+[~),
replacing w by +[}],
replacing x by ![],
replacing y by ]+(,
replacing z by [!}{,
replacing { by +!},
replacing | by ]+(!![]})[,
replacing } by +[],
replacing ~ by ][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]
and after all that we get for _ to be evaled the code
[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]][([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+([][[]]+[])[+!+[]]+(![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[+!+[]]+([][[]]+[])[+[]]+([][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[+!+[]+[+[]]]+(!![]+[])[+!+[]]]((![]+[])[+!+[]]+(![]+[])[!+[]+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]+(!![]+[])[+[]]+(![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]]+[+!+[]]+[!+[]+!+[]]+[!+[]+!+[]+!+[]]+[!+[]+!+[]+!+[]+!+[]]+[!+[]+!+[]+!+[]+!+[]+!+[]]+(!![]+[][(![]+[])[+[]]+([![]]+[][[]])[+!+[]+[+[]]]+(![]+[])[!+[]+!+[]]+(!![]+[])[+[]]+(!![]+[])[!+[]+!+[]+!+[]]+(!![]+[])[+!+[]]])[!+[]+!+[]+[+[]]])()
Now doesn't that look familiar? It's good old jsfuck!
I found this code as a part of some game engine code
I doubt it. Looks much more like a submission to a code obfusciation context. However, it doesn't appear to be hand-crafted, more likely someone just blindly chained multiple obfusciation tools together.

Regex character sets - and what they contain

I'm working on a pretty crude sanitizer for string input in Node(express):
I have glanced at some plugins and library, but it seems most of them are either too complex or too heavy. Therefor i decided to write a couple of simple sanitizer-functions on my own.
One of them is this one, for hard-sanitizing most strings (not numbers...)
function toSafeString( str ){
str = str.replace(/[^a-öA-Ö0-9\s]+/g, '');
return str;
}
I'm from Sweden, therefore i Need the åäö letters. And i have noticed that this regex also accept others charachters aswell... for example á or é....
Question 1)
Is there some kind of list or similar where i can see WHICH charachters are actually accepted in, say this regex: /[^a-ö]+/g
Question 2)
Im working in Node and Express... I'm thinking this simple function is going to stop attacks trough input fields. Am I wrong?
Question 1: Find out. :)
var accepted = [];
for(var i = 0; i < 65535 /* the unicode BMP */; i++) {
var s = String.fromCharCode(i);
if(/[a-ö]+/g.test(s)) accepted.push(s);
}
console.log(s.join(""));
outputs
abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³
´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
on my system.
Question 2: What attacks are you looking to stop? Either way, the answer is "No, probably not".
Instead of mangling user data (I'm sure your, say, French or Japanese customers will have some beef with your validation), make sure to sanitize your data whenever it's going into customer view or out thereof (HTML escaping, SQL parameter escaping, etc.).
[x-y] matches characters whose unicode numbers are between that of x and that of y:
charsBetween = function(a, b) {
var a = a.charCodeAt(0), b = b.charCodeAt(0), r = "";
while(a <= b)
r += String.fromCharCode(a++);
return r
}
charsBetween("a", "ö")
> "abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö"
See character tables for the reference.
For your validation, you probably want something like this instead:
[^a-zA-Z0-9ÅÄÖåäö\s]
This matches ranges of latin letters and digits + individual characters from a list.
There is a lot of characters that we actually have no idea about, like Japanese or Russian and many more.
So to take them in account we need to use Unicode ranges rather than ASCII ranges in regular expressions.
I came with this regular expression that covers almost all written letters of the whole Unicode table, plus a bit more, like numbers, and few other characters for punctuation (Chinese punctuation is already included in Unicode ranges).
It is hard to cover everything and probably this ranges might include too many characters including "exotic" ones (symbols):
/^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i
So I was using it this way to test (have to be not empty):
function validString(str) {
return str && typeof(str) == 'string' && /^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i.test(str);
}
Bear in mind that this is missing characters like:
:*()&#'\-:%
And many more others.

JavaScript: how to check if character is RTL?

How can I programmatically check if the browser treats some character as RTL in JavaScript?
Maybe creating some transparent DIV and looking at where text is placed?
A bit of context. Unicode 5.2 added Avestan alphabet support. So, if the browser has Unicode 5.2 support, it treats characters like U+10B00 as RTL (currently only Firefox does). Otherwise, it treats these characters as LTR, because this is the default.
How do I programmatically check this? I'm writing an Avestan input script and I want to override the bidi direction if the browser is too dumb. But, if browser does support Unicode, bidi settings shouldn't be overriden (since this will allow mixing Avestan and Cyrillic).
I currently do this:
var ua = navigator.userAgent.toLowerCase();
if (ua.match('webkit') || ua.match('presto') || ua.match('trident')) {
var input = document.getElementById('orig');
if (input) {
input.style.direction = 'rtl';
input.style.unicodeBidi = 'bidi-override';
}
}
But, obviously, this would render script less usable after Chrome and Opera start supporting Unicode 5.2.
function isRTL(s){
var ltrChars = 'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF',
rtlChars = '\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC',
rtlDirCheck = new RegExp('^[^'+ltrChars+']*['+rtlChars+']');
return rtlDirCheck.test(s);
};
playground page
I realize this is quite a while after the original question was asked and answered but I found vsync's update to be rather useful and just wanted to add some observations. I would add this in comment to his answer but my reputation is not high enough yet.
Instead of a regular expression that searches from the start of the line zero or more non-LTR characters and then one RTL character, wouldn't it make more sense to search from the start of the line zero or more weak/neutral characters and then one RTL character? Otherwise you have the potential for matching many RTL characters unnecessarily. I would welcome a more thorough examination of my weak/neutral character group as I merely used the negation of the combined LTR and RTL character groups.
Additionally, shouldn't characters such as LTR/RTL marks, embeds, overrides be included in the appropriate character groupings?
I would think then that the final code should look something like:
function isRTL(s){
var weakChars = '\u0000-\u0040\u005B-\u0060\u007B-\u00BF\u00D7\u00F7\u02B9-\u02FF\u2000-\u2BFF\u2010-\u2029\u202C\u202F-\u2BFF',
rtlChars = '\u0591-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC',
rtlDirCheck = new RegExp('^['+weakChars+']*['+rtlChars+']');
return rtlDirCheck.test(s);
};
Update
There may be some ways to speed up the above regular expression. Using a negated character class with a lazy quantifier seems to help improve speed (tested on http://regexhero.net/tester/?id=6dab761c-2517-4d20-9652-6d801623eeec, site requires Silverlight 5)
Additionally, if the directionality of the string is unknown, my guess is that for most cases the string will be LTR instead of RTL and creating an isLTR function would return results faster if that is the case but as OP is asking for isRTL, will provide isRTL function:
function isRTL(s){
var rtlChars = '\u0591-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC',
rtlDirCheck = new RegExp('^[^'+rtlChars+']*?['+rtlChars+']');
return rtlDirCheck.test(s);
};
Testing for both Hebrew and Arabic (the only modern RTL languages/character sets I know which flow right-to-left except for any Persian-related which I've not researched):
/[\u0590-\u06FF]/.test(textarea.value)
More research suggests something along the lines of:
/[\u0590-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC]/.test(textarea.value)
First addressing the question in the heading:
There are no tools in JavaScript as such for accessing Unicode properties of characters. You would need to find a library or service for the purpose (I’m afraid that might be difficult, if you need something reliable) or to extract the relevant information from the Unicode character “database” (a collection of text files in specific formats) and to write your own code to use it.
Then the question in message body:
This seems even more desperate. But as this would probably be something for a limited number of users who are knowledgeable and know Avestan, maybe it would not be too bad to display a string of Avestan characters along with an image of them in proper directionality and ask the user click on a button if the order is wrong. And you could save this selection in a cookie, so that the user needs to do this only once (per browser; though it should be relatively short-lived cookie, as the browser may get updated).
Thanks for your comments, but it seems I've done this myself:
function is_script_rtl(t) {
var d, s1, s2, bodies;
//If the browser doesn’t support this, it probably doesn’t support Unicode 5.2
if (!("getBoundingClientRect" in document.documentElement))
return false;
//Set up a testing DIV
d = document.createElement('div');
d.style.position = 'absolute';
d.style.visibility = 'hidden';
d.style.width = 'auto';
d.style.height = 'auto';
d.style.fontSize = '10px';
d.style.fontFamily = "'Ahuramzda'";
d.appendChild(document.createTextNode(t));
s1 = document.createElement("span");
s1.appendChild(document.createTextNode(t));
d.appendChild(s1);
s2 = document.createElement("span");
s2.appendChild(document.createTextNode(t));
d.appendChild(s2);
d.appendChild(document.createTextNode(t));
bodies = document.getElementsByTagName('body');
if (bodies) {
var body, r1, r2;
body = bodies[0];
body.appendChild(d);
var r1 = s1.getBoundingClientRect();
var r2 = s2.getBoundingClientRect();
body.removeChild(d);
return r1.left > r2.left;
}
return false;
}
Example of using:
Avestan in <script>document.write(is_script_rtl('𐬨𐬀𐬰𐬛𐬂') ? "RTL" : "LTR")</script>,
Arabic is <script>document.write(is_script_rtl('العربية') ? "RTL" : "LTR")</script>,
English is <script>document.write(is_script_rtl('English') ? "RTL" : "LTR")</script>.
It seems to work. :)
Here's another solution that is robust against minor amounts of RTL text in a primarily LTR string, or minor amounts of LTR text in a RTL string.
It works by counting the number of LTR or RTL characters, then classifies the string based on wether there are more LTR or RTL characters.
isRTL(text) {
let rtl_count = (text.match(/[\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC]/g) || []).length;
let ltr_count = (text.match(/[A-Za-z\u00C0-\u00C0\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF]/g) || []).length;
return (rtl_count > ltr_count);
}

Replace text (change case) in a textbox using Javascript

I am trying to build a sort of intelli-sense text input box, where as the user types, the 'and' is replaced by 'AND \n' (i.e. on each 'and', the 'and' is capitalized and user goes to new line).
The Javascript I used for this is:
function Validate()
{
document.getElementById("search").value = document.getElementById("search").value.replace("and","AND \n"); //new line for AND
}
The HTML part is like this:
< textarea type="text" name="q" id="search" spellcheck="false" onkeyup='Validate();'>< /textarea>
Though the above script works well on Firefox and Chrome, it sort-of misbehaves on Internet Explorer (brings the cursor to the end of the text on each 'KeyUp').
Also the above code doesn't work for the other variants of 'and' like 'And', 'anD' or even 'AND' itself.
I think the actual answer here is a mix of the two previous:
onkeyup="this.value = this.value.replace(/\band\b/ig, ' AND\n')"
You need the i to make the search case insensitive and the g to make sure you replace all occurrences. This is not very efficient, as it'll replace previous matches with itself, but it'll work.
To make it a separate function:
function validate() {
document.getElementById('search') = document.getElementById('search').replace(/\band\b/ig, ' AND\n');
}
If you alter the textarea contents while the user is typing the caret will always move to the end, even in Firefox and Chrome. Just try to edit something you already wrote and you'll understand me. You have to move the caret to the exact position where the users expects it, which also implies you have to detect text selections (it's a standard behaviour that typing when you have a selection removes the selected text).
You can find here some sample code. You might be able to use the doGetCaretPosition(), setCaretPosition() functions.
I tried to work around the problem and solved by using the following javascript:
function Validate() {
if( document.getElementById("search").value.search(/\band$(?!\n)/i) >= 0 ){ // for maintaining cursor position
document.getElementById("search").value = document.getElementById("search").value.replace(/\band$(?!\n)/i,"AND\n"); //new line for AND
}
}
Thin slicing the above problem and solution:
1) The function was being called on each key up, thus earlier "AND\n" was being repeated on each key up, thus inserting a blank line on each key press. I avoided the above by using the regex:
/\band$(?!\n)/i
\b = Like Word (to avoid SAND)
$ = End of line (as "and" will be replaced by "AND\n" thus it will always be end of line)
(?!\n) = Not followed by new line (to prevent repeatedly replacing "AND\n" on each key press)
i = To cover all variants of "and" including "And","anD" etc.
2) Internet Explorer was misbehaving and the cursor position was not maintained (moved to end) when the input was re-edited. This was caused (as hinted by Alvaro above) due to the replace function.
Thus I inserted an "if" statement to call replace function only when it is needed, i.e. only when there is some "and" needing replacement.
Thanks everyone for the help.
try using the following replace() statement:
replace(/\band(?!$)/ig, "And\n")
since this is being called repeatedly against the altered string you have to make sure that the "and" is not followed by a line break.
example (uses a loop and function to simulate the user typing the letters in):
function onkeyup() {
var str = this;
return this.replace(/\band(?!$)/ig, "And\n");
};
var expected = "this is some sample text And\n is separated by more text And\n stuff";
var text = "this is some sample text and is separated by more text and stuff";
var input = "";
var output = "";
for(var i = 0, len = text.length; i < len; i++ ) {
input += text.substr(i,1);
output = onkeyup.call(input);
}
var result = expected == output;
alert(result);
if( !result ) {
alert("expected: " + expected);
alert("actual: " + output);
}
you can test this out here: http://bit.ly/8kWLtr
You need to write a JS code that run in both IE and FireFox. I think this is what you need:
var s = document.getElementbyId('Search');
s.value = s.value.replace('and', 'AND \n');
I think you want your replace call to look like this:
replace(/\band\b/i,"AND \n") (see below)
That way it is not case sensitive (/i), and only takes single words that match and, so 'sand' and similar words that contain 'and' don't match, only 'and' on it's own.
EDIT: I played around with it based on the comments and I think this working example is what is wanted.
Replace the onKeyUp event with onBlur:
<textarea type="text" name="q" id="search" spellcheck="false" onblur='Validate();'></textarea></body>
So that the validate function is only run when the user leaves the text box. You could also run it onSubmit.
I also added a global switch (g) and optional trailing whitespace (\s?) to the regex:
replace(/\band\b\s?/ig,"AND \n")
This causes input like this:
sand and clay and water
to be transformed into this when you leave the text box:
sand AND
clay AND
water
You should probably test this against a bunch more cases.

Categories

Resources