cannot get utf8 icon with charAt js - javascript

let a=["๐Ÿ˜"] ;
let b="๐Ÿ˜"
console.log(a[0],b,b.charAt(0))
"๐Ÿ˜"
"๐Ÿ˜"
"๏ฟฝ"
this charAt prints questions mark ... can someone enlighten me how to get utf8 icon with charAt in a string

Emojis are represented using multiple bytes:
let b="๐Ÿ˜"
console.log(b.length)
With charAt you only get one part of the emoji.
With codePoint you can get the:
a non-negative integer that is the Unicode code point value at the given position.
You however need to know where the emojii starts at, because the index itself still refers to the bytes:
let b="๐Ÿ˜๐Ÿ‘†๐Ÿ‘"
console.dir(String.fromCodePoint(b.codePointAt(0)));
console.dir(String.fromCodePoint(b.codePointAt(2)));
You could split your string using the spread operator ... and then access the visible char in question using the index. This works because the iterator uses the code points and not individual bytes.
let b="๐Ÿ˜test๐Ÿ‘†test๐Ÿ‘"
function splitStringByCodePoint(str) {
return [...str]
}
console.log(splitStringByCodePoint(b)[5])
But that won't work with emojis like ๐Ÿ‘†๐Ÿฝ because those consist of ๐Ÿ‘† + byte(s) representing the variation (color).
let b="๐Ÿ‘†๐Ÿฝ"
console.log(b.length)
function splitStringByCodePoint(str) {
return [...str]
}
console.log(splitStringByCodePoint(b))
console.log(String.fromCodePoint(b.codePointAt(0)));
console.log(String.fromCodePoint(b.codePointAt(2)));
If you want to support all emojis you currently need to write your own parser or look for a library that does that.

Related

Check strings equality without direction? [duplicate]

How can I remove non-printable unicode characters in a multi-language input?
When users with different localizations paste strings they will sometimes unintentionally embed non-printing characters. For example:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC"
var displaysAs = decodeURI(weird); // Users see only "Test"
But I can't figure out how to strip the non-printing characters in a way that doesn't impact other languages like these:
encodeURI("ุดู†ุท") = "%D8%B4%D9%86%D8%B7"
encodeURI("ๆˆฆ่‰ฆๅธๅ›ฝ") = "%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
For example, the following attempt to repair the weird example above doesn't work:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC";
var displaysAs = decodeURI(weird);
var stillWeird = encodeURI(displaysAs.replace(/\s/g, ""));
// value is again "%E2%80%AA%E2%80%8ETest%E2%80%AC"
console.log('before:', weird);
console.log('after:', displaysAs);
console.log('again:', stillWeird);
.as-console-wrapper{min-height:100%}
As noted in the comments, this is largely a specification problem. I don't have an enumeration of non-printing unicode expressions. I can only observe that one can paste a unicode string into a browser Input and not be aware that it has undisplayed characters in it. I assume that some logic in the browser determines whether each unicode character will display something. This problem would be solved if I can apply that same logic to the underlying string in order to get the "display string."
Put another way: For any two unicode strings that look identical on the browser, I need a transformation that guarantees that their values are identical.
You can use the regular expression found in this other answer.
Example with array of the three strings provided in the question:
let weird = [
"%E2%80%AA%E2%80%8ETest%E2%80%AC",
"%D8%B4%D9%86%D8%B7",
"%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
];
const expr = /[\0-\x1F\x7F-\x9F\xAD\u0378\u0379\u037F-\u0383\u038B\u038D\u03A2\u0528-\u0530\u0557\u0558\u0560\u0588\u058B-\u058E\u0590\u05C8-\u05CF\u05EB-\u05EF\u05F5-\u0605\u061C\u061D\u06DD\u070E\u070F\u074B\u074C\u07B2-\u07BF\u07FB-\u07FF\u082E\u082F\u083F\u085C\u085D\u085F-\u089F\u08A1\u08AD-\u08E3\u08FF\u0978\u0980\u0984\u098D\u098E\u0991\u0992\u09A9\u09B1\u09B3-\u09B5\u09BA\u09BB\u09C5\u09C6\u09C9\u09CA\u09CF-\u09D6\u09D8-\u09DB\u09DE\u09E4\u09E5\u09FC-\u0A00\u0A04\u0A0B-\u0A0E\u0A11\u0A12\u0A29\u0A31\u0A34\u0A37\u0A3A\u0A3B\u0A3D\u0A43-\u0A46\u0A49\u0A4A\u0A4E-\u0A50\u0A52-\u0A58\u0A5D\u0A5F-\u0A65\u0A76-\u0A80\u0A84\u0A8E\u0A92\u0AA9\u0AB1\u0AB4\u0ABA\u0ABB\u0AC6\u0ACA\u0ACE\u0ACF\u0AD1-\u0ADF\u0AE4\u0AE5\u0AF2-\u0B00\u0B04\u0B0D\u0B0E\u0B11\u0B12\u0B29\u0B31\u0B34\u0B3A\u0B3B\u0B45\u0B46\u0B49\u0B4A\u0B4E-\u0B55\u0B58-\u0B5B\u0B5E\u0B64\u0B65\u0B78-\u0B81\u0B84\u0B8B-\u0B8D\u0B91\u0B96-\u0B98\u0B9B\u0B9D\u0BA0-\u0BA2\u0BA5-\u0BA7\u0BAB-\u0BAD\u0BBA-\u0BBD\u0BC3-\u0BC5\u0BC9\u0BCE\u0BCF\u0BD1-\u0BD6\u0BD8-\u0BE5\u0BFB-\u0C00\u0C04\u0C0D\u0C11\u0C29\u0C34\u0C3A-\u0C3C\u0C45\u0C49\u0C4E-\u0C54\u0C57\u0C5A-\u0C5F\u0C64\u0C65\u0C70-\u0C77\u0C80\u0C81\u0C84\u0C8D\u0C91\u0CA9\u0CB4\u0CBA\u0CBB\u0CC5\u0CC9\u0CCE-\u0CD4\u0CD7-\u0CDD\u0CDF\u0CE4\u0CE5\u0CF0\u0CF3-\u0D01\u0D04\u0D0D\u0D11\u0D3B\u0D3C\u0D45\u0D49\u0D4F-\u0D56\u0D58-\u0D5F\u0D64\u0D65\u0D76-\u0D78\u0D80\u0D81\u0D84\u0D97-\u0D99\u0DB2\u0DBC\u0DBE\u0DBF\u0DC7-\u0DC9\u0DCB-\u0DCE\u0DD5\u0DD7\u0DE0-\u0DF1\u0DF5-\u0E00\u0E3B-\u0E3E\u0E5C-\u0E80\u0E83\u0E85\u0E86\u0E89\u0E8B\u0E8C\u0E8E-\u0E93\u0E98\u0EA0\u0EA4\u0EA6\u0EA8\u0EA9\u0EAC\u0EBA\u0EBE\u0EBF\u0EC5\u0EC7\u0ECE\u0ECF\u0EDA\u0EDB\u0EE0-\u0EFF\u0F48\u0F6D-\u0F70\u0F98\u0FBD\u0FCD\u0FDB-\u0FFF\u10C6\u10C8-\u10CC\u10CE\u10CF\u1249\u124E\u124F\u1257\u1259\u125E\u125F\u1289\u128E\u128F\u12B1\u12B6\u12B7\u12BF\u12C1\u12C6\u12C7\u12D7\u1311\u1316\u1317\u135B\u135C\u137D-\u137F\u139A-\u139F\u13F5-\u13FF\u169D-\u169F\u16F1-\u16FF\u170D\u1715-\u171F\u1737-\u173F\u1754-\u175F\u176D\u1771\u1774-\u177F\u17DE\u17DF\u17EA-\u17EF\u17FA-\u17FF\u180F\u181A-\u181F\u1878-\u187F\u18AB-\u18AF\u18F6-\u18FF\u191D-\u191F\u192C-\u192F\u193C-\u193F\u1941-\u1943\u196E\u196F\u1975-\u197F\u19AC-\u19AF\u19CA-\u19CF\u19DB-\u19DD\u1A1C\u1A1D\u1A5F\u1A7D\u1A7E\u1A8A-\u1A8F\u1A9A-\u1A9F\u1AAE-\u1AFF\u1B4C-\u1B4F\u1B7D-\u1B7F\u1BF4-\u1BFB\u1C38-\u1C3A\u1C4A-\u1C4C\u1C80-\u1CBF\u1CC8-\u1CCF\u1CF7-\u1CFF\u1DE7-\u1DFB\u1F16\u1F17\u1F1E\u1F1F\u1F46\u1F47\u1F4E\u1F4F\u1F58\u1F5A\u1F5C\u1F5E\u1F7E\u1F7F\u1FB5\u1FC5\u1FD4\u1FD5\u1FDC\u1FF0\u1FF1\u1FF5\u1FFF\u200B-\u200F\u202A-\u202E\u2060-\u206F\u2072\u2073\u208F\u209D-\u209F\u20BB-\u20CF\u20F1-\u20FF\u218A-\u218F\u23F4-\u23FF\u2427-\u243F\u244B-\u245F\u2700\u2B4D-\u2B4F\u2B5A-\u2BFF\u2C2F\u2C5F\u2CF4-\u2CF8\u2D26\u2D28-\u2D2C\u2D2E\u2D2F\u2D68-\u2D6E\u2D71-\u2D7E\u2D97-\u2D9F\u2DA7\u2DAF\u2DB7\u2DBF\u2DC7\u2DCF\u2DD7\u2DDF\u2E3C-\u2E7F\u2E9A\u2EF4-\u2EFF\u2FD6-\u2FEF\u2FFC-\u2FFF\u3040\u3097\u3098\u3100-\u3104\u312E-\u3130\u318F\u31BB-\u31BF\u31E4-\u31EF\u321F\u32FF\u4DB6-\u4DBF\u9FCD-\u9FFF\uA48D-\uA48F\uA4C7-\uA4CF\uA62C-\uA63F\uA698-\uA69E\uA6F8-\uA6FF\uA78F\uA794-\uA79F\uA7AB-\uA7F7\uA82C-\uA82F\uA83A-\uA83F\uA878-\uA87F\uA8C5-\uA8CD\uA8DA-\uA8DF\uA8FC-\uA8FF\uA954-\uA95E\uA97D-\uA97F\uA9CE\uA9DA-\uA9DD\uA9E0-\uA9FF\uAA37-\uAA3F\uAA4E\uAA4F\uAA5A\uAA5B\uAA7C-\uAA7F\uAAC3-\uAADA\uAAF7-\uAB00\uAB07\uAB08\uAB0F\uAB10\uAB17-\uAB1F\uAB27\uAB2F-\uABBF\uABEE\uABEF\uABFA-\uABFF\uD7A4-\uD7AF\uD7C7-\uD7CA\uD7FC-\uF8FF\uFA6E\uFA6F\uFADA-\uFAFF\uFB07-\uFB12\uFB18-\uFB1C\uFB37\uFB3D\uFB3F\uFB42\uFB45\uFBC2-\uFBD2\uFD40-\uFD4F\uFD90\uFD91\uFDC8-\uFDEF\uFDFE\uFDFF\uFE1A-\uFE1F\uFE27-\uFE2F\uFE53\uFE67\uFE6C-\uFE6F\uFE75\uFEFD-\uFF00\uFFBF-\uFFC1\uFFC8\uFFC9\uFFD0\uFFD1\uFFD8\uFFD9\uFFDD-\uFFDF\uFFE7\uFFEF-\uFFFB\uFFFE\uFFFF]/g;
weird.map(decodeURI).forEach(el => {
let trimmed = el.replace(expr, '')
console.log(trimmed, trimmed.length);
});
If you only want to trim these non-printable characters from the beginning and/or end of the string, you'll need to assert start (^) and end ($) in the regular expression.

How to get the correct element from a unicode string?

I want to get specific letters from an unicode string using index. However, it doesn't work as expected.
Example:
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
var normal = `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890`
console.log(normal[3]) // gives 'd' but
console.log(handwriting[3]) // gives '๏ฟฝ' instead of '๐–‰'
also length doesn't work as expected normal.length gives correct value as 62 but handwriting.length gives 114.
Indexing doesn't work as expected. How can I access the elements of unicode array?
I tried this on python it works perfectly but in Javascript it is not working.
I need exact characters from the unicode string like an expected output of 'd' '๐–‰' for index 3
In Javascript, a string is a sequence of 16-bit code points. Since these characters are encoded above the Basic Multilingual Plane, it means that they are represented by a pair of code points, also known as a surrogate pair.
Reference
Unicode number of ๐–† is U+1D586. And 0x1D586 is greater than 0xFFFF (2^16). So, ๐–† is represented by a pair of code points, also known as a surrogate pair
console.log("๐–†".length)
console.log("๐–†" === "\uD835\uDD86")
One way is to create an array of characters using the spread syntax or Array.from() and then get the index you need
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
console.log([...handwriting][3])
console.log(Array.from(handwriting)[3])
A unicode character looks like '\u00E9' so if your string is longer this is normal.
To have the real length of a unicode string, you have to convert it to an array :
let charArray = [...handwriting]
console.log(charArray.length) //=62
Each item of your array is a char of your string.
charArray[3] will return you the unicode char corresponding to '๐–‰'

Extracting a complicated part of the string with plain Javascript

I have a following string:
Text
I want to extract from this string, with the use of JavaScript 'pl' or 'pl_company_com'
There are a few variables:
jan_kowalski is a name and surname it can change, and sometimes even have 3 elements
the country code (in this example 'pl') will change to other en / de / fr (this is that part of the string i want to get)
the rest of the string remains the same for every case (beginning + everything after starting with _company_com ...
Ps. I tried to do it with split, but my knowledge of JS is very basic and I cant get what i want, plase help
An alternative to Randy Casburn's solution using regex
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_(.*_company_com)')[1];
console.log(out);
Or if you want to just get that string with those country codes you specified
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
A proof of concept that this solution also works for other combinations
let urls = [
new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx'),
new URL('https://my.domain.com/personal/firstname_middlename_lastname_pl_company_com/Documents/Forms/All.aspx')
]
urls.forEach(url => console.log(url.href.match('.*_(en|de|fr|pl).*')[1]))
I have been very successful before with this kind of problems with regular expressions:
var string = 'Text';
var regExp = /([\w]{2})_company_com/;
find = string.match(regExp);
console.log(find); // array with found matches
console.log(find[1]); // first group of regexp = country code
First you got your given string. Second you have a regular expression, which is marked with two slashes at the beginning and at the end. A regular expression is mostly used for string searches (you can even replace complicated text in all major editors with it, which can be VERY useful).
In this case here it matches exactly two word characters [\w]{2} followed directly by _company_com (\w indicates a word character, the [] group all wanted character types, here only word characters, and the {}indicate the number of characters to be found). Now to find the wanted part string.match(regExp) has to be called to get all captured findings. It returns an array with the whole captured string followed by all capture groups within the regExp (which are denoted by ()). So in this case you get the country code with find[1], which is the first and only capture group of the regular expression.

Find longest repeating substring in JavaScript using regular expressions

I'd like to find the longest repeating string within a string, implemented in JavaScript and using a regular-expression based approach.
I have an PHP implementation that, when directly ported to JavaScript, doesn't work.
The PHP implementation is taken from an answer to the question "Find longest repeating strings?":
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $input, $matches, PREG_SET_ORDER);
This will populate $matches[0][X] (where X is the length of $matches[0]) with the longest repeating substring to be found in $input. I have tested this with many input strings and found am confident the output is correct.
The closest direct port in JavaScript is:
var matches = /(?=((.+)(?:.*?\2)+))/.exec(input);
This doesn't give correct results
input Excepted result matches[0][X]
======================================================
inputinput input input
7inputinput input input
inputinput7 input input
7inputinput7 input 7
XXinputinputYY input XX
I'm not familiar enough with regular expressions to understand what the regular expression used here is doing.
There are certainly algorithms I could implement to find the longest repeating substring. Before I attempt to do that, I'm hoping a different regular expression will produce the correct results in JavaScript.
Can the above regular expression be modified such that the expected output is returned in JavaScript? I accept that this may not be possible in a one-liner.
Javascript matches only return the first match -- you have to loop in order to find multiple results. A little testing shows this gets the expected results:
function maxRepeat(input) {
var reg = /(?=((.+)(?:.*?\2)+))/g;
var sub = ""; //somewhere to stick temp results
var maxstr = ""; // our maximum length repeated string
reg.lastIndex = 0; // because reg previously existed, we may need to reset this
sub = reg.exec(input); // find the first repeated string
while (!(sub == null)){
if ((!(sub == null)) && (sub[2].length > maxstr.length)){
maxstr = sub[2];
}
sub = reg.exec(input);
reg.lastIndex++; // start searching from the next position
}
return maxstr;
}
// I'm logging to console for convenience
console.log(maxRepeat("aabcd")); //aa
console.log(maxRepeat("inputinput")); //input
console.log(maxRepeat("7inputinput")); //input
console.log(maxRepeat("inputinput7")); //input
console.log(maxRepeat("7inputinput7")); //input
console.log(maxRepeat("xxabcdyy")); //x
console.log(maxRepeat("XXinputinputYY")); //input
Note that for "xxabcdyy" you only get "x" back, as it returns the first string of maximum length.
It seems JS regexes are a bit weird. I don't have a complete answer, but here's what I found.
Although I thought they did the same thing re.exec() and "string".match(re) behave differently. Exec seems to only return the first match it finds, whereas match seems to return all of them (using /g in both cases).
On the other hand, exec seems to work correctly with ?= in the regex whereas match returns all empty strings. Removing the ?= leaves us with
re = /((.+)(?:.*?\2)+)/g
Using that
"XXinputinputYY".match(re);
returns
["XX", "inputinput", "YY"]
whereas
re.exec("XXinputinputYY");
returns
["XX", "XX", "X"]
So at least with match you get inputinput as one of your values. Obviously, this neither pulls out the longest, nor removes the redundancy, but maybe it helps nonetheless.
One other thing, I tested in firebug's console which threw an error about not supporting $1, so maybe there's something in the $ vars worth looking at.

JavaScript regular expressions - match a series of hexadecimal numbers

Greetings JavaScript and regular expression gurus,
I want to return all matches in an input string that are 6-digit hexadecimal numbers with any amount of white space in between. For example, "333333 e1e1e1 f4f435" should return an array:
array[0] = 333333
array[1] = e1e1e1
array[2] = f4f435
Here is what I have, but it isn't quite right-- I'm not clear how to get the optional white space in there, and I'm only getting one match.
colorValuesArray = colorValues.match(/[0-9A-Fa-f]{6}/);
Thanks for your help,
-NorthK
Use the g flag to match globally:
/[0-9A-Fa-f]{6}/g
Another good enhancement would be adding word boundaries:
/\b[0-9A-Fa-f]{6}\b/g
If you like you could also set the i flag for case insensitive matching:
/\b[0-9A-F]{6}\b/gi
Alternatively to the answer above, a more direct approach might be:
/\p{Hex_Digit}{6}/ug
You can read more about Unicode Properties here.
It depends on the situation, but I usually want to make sure my code can't silently accept (and ignore, or misinterpret) incorrect input. So I would normally do something like this.
var arr = s.split();
for (var i = 0; i < arr.length; i++) {
if (!arr[i].match(/^[0-9A-Fa-f]{6}$/)
throw new Error("unexpected junk in string: " + arr[i]);
arr[i] = parseInt(arr[i], 16);
}
try:
colorValues.match(/[0-9A-Fa-f]{6}/g);
Note the g flag to Globally match.
result = subject.match(/\b[0-9A-Fa-f]{6}\b/g);
gives you an array of all 6-digit hexadecimal numbers in the given string subject.
The \b word boundaries are necessary to avoid matching parts of longer hexadecimal numbers.
For people who are looking for hex color with alpha code, the following regex works:
/\b[0-9A-Fa-f]{6}[0-9A-Fa-f]{0,2}\b\g
The code allows both hex with or without the alpha code.

Categories

Resources