What I want is to calculate how much time the caret will move from the beginning till the end of the string.
Explanations:
Look this string "" in this fiddle: http://jsfiddle.net/RFuQ3/
If you put the caret before the first quote then push the right arrow ► you will push 3 times to arrive after the second quote (instead of 2 times for an empty string).
The first way, and the easiest to calculate the length of a string is <string>.length.
But here, it returns 2.
The second way, from JavaScript Get real length of a string (without entities) gives 2 too.
How can I get 1?
1-I thought to a way to put the string in a text input, and then do a while loop with a try{setCaret}catch(){}
2-It's just for fun
The character in your question "" is the
Unicode Character 'LANGUAGE TAG' (U+E0001).
From the following Stack Overflow questions,
" Expressing UTF-16 unicode characters in JavaScript"
" How can I tell if a string contains multibyte characters in Javascript?"
we learn that
JavaScript strings are UCS-2 encoded but can represent Unicode code points outside the Basic Multilingual Pane (U+0000-U+D7FF and U+E000-U+FFFF) using two 16 bit numbers (a UTF-16 surrogate pair), the first of which must be in the range U+D800-U+DFFF.
The UTF-16 surrogate pair representing "" is U+DB40 and U+DC01. In decimal U+DB40 is 56128, and U+DC01 is 56321.
console.log("".length); // 2
console.log("".charCodeAt(0)); // 56128
console.log("".charCodeAt(1)); // 56321
console.log("\uDB40\uDC01" === ""); // true
console.log(String.fromCharCode(0xDB40, 0xDC01) === ""); // true
Adapting the code from https://stackoverflow.com/a/4885062/788324, we just need to count the number of code points to arrive at the correct answer:
var getNumCodePoints = function(str) {
var numCodePoints = 0;
for (var i = 0; i < str.length; i++) {
var charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
i++;
}
numCodePoints++;
}
return numCodePoints;
};
console.log(getNumCodePoints("")); // 1
jsFiddle Demo
function realLength(str) {
var i = 1;
while (str.substring(i,i+1) != "") i++;
return (i-1);
}
Didn't try the code, but it should work I think.
Javascript doesn't really support unicode.
You can try
yourstring.replace(/[\uD800-\uDFFF]{2}/g, "0").length
for what it's worth
Related
When a decimal point is not allowed at the beginning or end, it can be in the middle and there must be only one decimal point.
I used regular expressions to create the expression I wanted. Numbers must be entered, but no English characters or other string values can be used. Only one decimal point can be used, but I do not want to allow a decimal point at the beginning. But the last one is allowed to be inserted. One decimal point in the middle of a number with a trailing decimal point is allowed. In addition, even if there is no decimal point in the middle of a number, it is allowed to have a decimal point at the end.
like this
(o )13.4. 13.
(x) .
However, when using my regular expression, the decimal point is used more than once, and the decimal point is also used at the beginning and end.
this is my regex
let regex = /[^\d.]/g;
how can i fix this?
const str = '123.12';
const regex = new RegExp('^\\d+([.]\\d+)?$');
console.log(regex.test(str));
in my. you might find a way to fix your problem ,but regex sometimes is not the best solution. if possible ,just write a method to limit ,
1.only numbers or point are allowed
2.point is only one time in the string, but not begin or the end
it's might not hard for you, and trust me it would be fast.
here are some exp:
if(str[0]==='.' || str[str.length -1] === '.'){
return false;
}
let pointCount = 0;
for(let i=0; i<str.length; i++){
let uniCode = str.charCodeAt(i);
// .
if(uniCode === 46){
pointCount ++;
}
// number and point
if(uniCode < 48 && uniCode > 57 && uniCode !== 46){
return false;
}
}
if(pointCount > 1){
return false;
}
Let me summarize your requirements as your message and provided expression is not easy to understand or conflicting with each other.
allowed are values, which…
in general contain nothing else than numbers and maybe dots
just number(s), eg. 1 or 12
just number(s) with trailing dot, eg. 1. or 12.
number(s) with decimal(s), eg. 1.2 or 12.3 or 12.34
number(s) with decimal(s) and trailing dot, eg. 1.2. or 12.3. or 12.34.
disallowed are values, which…
are empty
start with a dot
contain repeating dots
contain more than two dots
If that is correct, then I would go with the following expression.
/^\d+(?:\.\d+)?\.?$/
Have you considered that negative numbers could appear in you data? If that is the case then take the following expression
/^-?\d+(?:\.\d+)?\.?$/
You can remove the ?: parts if you do not mind about the capturing groups and look for better readability.
Background
I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don't already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.
On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I've been using a UTF-8 en/decoder (utf8.js). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?
Question
Given a file encoded in ISO 8859-7 obtained from an <input> element on the page, is there any way to obtain the number of bytes contained in that file? And, given "plaintext" (i.e. text put directly into a <textarea>), how might I count the bytes in that as if it was encoded in ISO 8859-7?
What I've tried
The input element is called isogreek. The file resides in the <input> element. The content is ΦX族, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).
isogreek.files[0].size; // is 3; should be more.
var reader = new FileReader();
reader.readAsBinaryString(isogreek.files[0]); // corrupts the string to `ÖX?`
reader.readAsText(isogreek.files[0]); // �X?
reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?
Extended from this comment.
As #pvg mentioned in the comments, the string resulting from readAsBinaryString would be correct, but is corrupted for two reasons:
A. The result is encoded in ISO-8859-1. You can use a function to fix this:
function convertFrom1to7(text) {
// charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format:
// - If the character is in the same position as in ISO-8859-1/Unicode, use a "!".
// - If the character is a Greek char with 720 subtracted from its char code, use a ".".
// - Otherwise, use \uXXXX format.
var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!";
var newtext = "", newchar = "";
for (var i = 0; i < text.length; i++) {
var char = text[i];
newchar = char;
if (char.charCodeAt(0) >= 160) {
newchar = charset[char.charCodeAt(0) - 160];
if (newchar === "!") newchar = char;
if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720);
}
newtext += newchar;
}
return newtext;
}
B. The Chinese character isn't a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:
Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you'll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:
function isValidISO_8859_7(text) {
var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/;
var valid = true;
for (var i = 0; i < text.length; i++) {
valid = valid && charset.test(text[i]);
}
return valid;
}
Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between 80 and 9F setting up for the next few. Here's a basic example that uses 80 as the 2-byter and 81 as the 3-byter (assumes the text is encoded in ISO-8859-1):
function reUnicode(text) {
var newtext = "";
for (var i = 0; i < text.length; i++) {
if (text.charCodeAt(i) === 0x80) {
newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i));
} else if (text.charCodeAt(i) === 0x81) {
var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536;
newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair
} else {
newtext += convertFrom1to7(text[i]);
}
}
return newtext;
}
I can go into either method in more detail if you desire.
The three characters you gave as an example are decoded in 6 bytes a6 ce e6 58 8f 97 (0x58 = X). Also: JavaScript works with utf16 which results in some funny things like ("abc".length === "ΦX族".length) being true.
You most probably need to go to the full length and check every single character for its length by its code-value. You may also need to check two characters in some cases (utf-32 to utf-16). A BOM needs to be placed and checked, too, if necessary (always necessary if you work with files of unknown sources).
EDIT: added on request:
The encodings of the characters in JavaScript is always in utf-16, a two byte representation of the character. That was all well and nice until they suddenly (ha!) found out that two bytes are not really sufficient for all of the alphabets of the world, so the expanded the Unicode range to four bytes: utf-32.
Well, the Unicode consortium did so but the ECMA committee did not.
It cannot be said that hell broke loose but it is quite close in some circumstances, and one of those is your case because you want to mix one-byte encodings with multiple-byte encodings, different ones even.
One byte fits well in two bytes but three or more bytes do not fit well in two bytes, so the so called surrogates were invented. These surrogates are also the reason why it is not so simple to reverse a string in JavaScript.
As I said: a large can of worms.
I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.
Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "😀".length // There is an emoji in the string (if you don’t see it)
vs
var str = "😀"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.
Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.
I have "name" JavaScript variable. If variable "name" contains less than 4 characters I want to execute line: msg('name','Your name must contain minimum 4 characters.')';
I have tried something like this but it interpretated mathematical. Any idea? Thank you.
if(name < 4 ) {
msg('name','Your name must contain minimum 4 characters.');
return false;
}
if (name.length < 4) {
...
}
You probably want to check the length of the string, not the numeric value of the string itself:
if(name.length < 4) {
// ...
if(name.length < 4) {
//Do something
}
You have to check the length of the variable.
length can also be used to check the length of an Array
\n (new line) is also counted as a character.
Depending on your definition of “character”, all answers posted so far are incorrect. The string.length answer is only reliable when you’re certain that only BMP Unicode symbols will be entered. For example, 'a'.length == 1, as you’d expect.
However, for supplementary (non-BMP) symbols, things are a bit different. For example, '𝌆'.length == 2, even though there’s only one Unicode symbol there. This is because JavaScript exposes UCS-2 code units as “characters”.
Luckily, it’s still possible to count the number of Unicode symbols in a JavaScript string through some hackery. You could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points for this:
// `String.length` replacement that only counts full Unicode characters
punycode.ucs2.decode('a').length; // 1
punycode.ucs2.decode('𝌆').length; // 1 (note that `'𝌆'.length == 2`!)
I have an array of phone numbers and I need to find if a particular phone number is in it.
What I tried doing at first was if(arr.indexOf(phoneNumber) != -1) { bla.. }. And it worked - sometimes.
I later discovered that since the number/s would arrive from different phones/entry forms, some people use country codes (like +1-xxx-xxx-xxxx), some wouldn't. Some use spaces as seperators and some just put in 10 digits in a row. In short - hell to compare.
What I need is an elegant solution that would allow me to compare, hopefully without having to replicate or change the original array.
In C++ you can define comparison operators. I envision my solution as something like this pseudo-code, hopefully using some smart regex:
function phoneNumberCompare(a, b) {
a = removeAllSeperators(a); //regex??
a = a.substring(a.length, a.length - 10);
b = removeAllSeperators(b); //regex??
b = b.substring(b.length, b.length - 10);
return (a < b ? -1 : (a == b ? 0 : 1)); //comaprison in C++ returns -1, 0, 1
}
and use it like if(arr.indexOf(phoneNumber, phoneNumberCompare) != -1)
Now, I know a solution like this construct does not exist in JavaScript, but can someone suggest something short and elegant that achieves the desired result?
As always, thanks for your time.
PS: I know indexOf() already has a second parameter (position), the above is just ment to illustrate what I need.
You really should sanitize all the data, both at collection and in the DB.
But for now, here's what you asked for:
function bPhoneNumberInArray (targetNum, numArray) {
var targSanitized = targetNum.replace (/[^\d]/g, "")
.replace (/^.*(\d{10})$/, "$1");
//--- Choose a character that is unlikely to ever be in a valid entry.
var arraySanitized = numArray.join ('Á').replace (/[^\dÁ]/g, "") + 'Á';
//--- Only matches numbers that END with the target 10 digits.
return (new RegExp (targSanitized + 'Á') ).test (arraySanitized);
}
How it works:
The first statement removes everything but digits (0-9) from the target number and then strips out anything before the last 10 digits.
Then we convert the array to be searched into a string (very fast operation).
When joining the array, we use some character to separate each entry.
It must be a character that we are reasonably sure would never appear in the array. In this case we chose Á. It could be anything that doesn't ever appear in the array.
So, an array: [11, 22, 33] becomes a string: 11Á22Á33Á, for example.
The final regex, then searches for the target number immediately followed by our marker-character -- which signals the end of each entry. This ensures that only the last 10 digits of an array's number are checked against the 10-digit target.
Testing:
var numArray = ['0132456789', "+14568794324", "123-456-7890"];
bPhoneNumberInArray ("+1-456-879-4324", numArray) // true
bPhoneNumberInArray ("+14568794324", numArray) // true
bPhoneNumberInArray ("4568794324", numArray) // true
bPhoneNumberInArray ("+145 XXX !! 68794324", numArray) // true !
bPhoneNumberInArray ("+1-666-879-4324", numArray) // false
You should sanitize both the input and all array values, to make sure they conform to the same ruleset.
Just create a function called sanitizePhonenumber, where you strip (or add, depending on your preferences) the country code and all other signs you dont want there.
After that you can just compare them as you are doing now.