I have seen that sort function in javascript transforms every letter of a word in ascii code to permit the comparison between words when an alphabetical sort is required.How this function manage to find asciI code for every letters?Does It slide a list for every letters?
What is the method with the function associate ascii code to a letter?
Thank you so much for help:)
Abstractly, characters are numbers in that what we think of as a character "a" and the number "97" are both a byte "01100001".
It is a lot more nuanced than that as unicode supports multi-byte characters and numbers in javascript are multi-byte floating points but the concept holds at a high level.
An "encoding" such as ASCII, WE8DEC, or one of the Unicode flavors is essentially a way to map a (set of) byte to what we think of as a character.
So, if numbers can be sorted, then so too can characters and thus strings.
You might also be interested in this post that explains the native sorting rules : How does JavaScript decide sort order with characters from different character sets?
Related
I'm trying to match a string starting from the last character to fail as soon as possible. This way I can fail a match with a custom string cstr (see specification below) with least amount of operations (4th property).
From a theoritical perspective the regex can be represented as a finite state mashine and the arrows can be flipped, creating the reversed regex.
I'm looking for an implementation of this. A library/program which I can give the string and the pattern. cstr is implemented in python, so if possible a python module. (For the curious i-th character is not calculated until needed.) For anything other I need to do much more work because of cstr's calculation is hard to port to another language.
The implementation doesn't have to cover all latex syntax. I'm looking for the basics. No lookaheads or fancy stuff. See specification below.
I may be lacking common knowledge. Please do comment obvious things, too.
Specification
The custom string cstr has the following properties:
String can be calculated in finite time.
String has finite length
The last character is known
Every previous character requires a costly calculation
Until the string is calculated fully, length is unknown
When the string is calcualted fully, I want to match it with a simple regex which may contain these from the syntax. No look aheads or fancy stuff.
alphanumeric characters
uinicode characters
., *, +, ?, \w, \W, [], |, escape char \, range specifitation with { , }
PS: This is not a homework question. I'm trying to formulate my question as clear as possible.
OP here. Here are some thougts:
Since I'm looking for an unoptimized regex mashine, I have to build it myself, which takes time.
Alternatively we can define an upperbound for cstr length and create all strings that matches given regex with length < upperbound. Then we put all solutions to a tire data structure and match it. This depends on the use case and maybe a cache can be involved.
What I'm going for is python module greenery
from greenery import parse
pattern = parse.Pattern(...)
pattern.reversed()
...
this sometimes provieds a good matching experience. Sometimes not but it is ok for me.
In JavaScript I am using NFKC normalization via String.prototype.normalize to normalize fullwidth to standard ASCII halfwidth characters.
'1'.normalize('NFKC') === '1'
> true
However, looking at more obscure digits like ૫ which is the digit 5 in Gujarati it does not normalize.
'૫'.normalize('NFKC') === '5'
> false
What am I missing?
Unicode normalisation is meant for characters that are variants of each other, not for every set of characters that might have similar meanings.
The character ‘1’ (FULLWIDTH DIGIT ONE) is essentially just the character ‘1’ (DIGIT ONE) with slightly different styling and would not have been encoded if it was not necessary for compatibility. They are – in some contexts – completely interchangeable, so the former was assigned a decomposition mapping to the latter. The character ‘૫’ (GUJARATI DIGIT FIVE) does not have a decomposition mapping because it is not a variant of any other character; it is its own distinct thing.
You can consult the Unicode Character Database to see which characters decompose and which (i.e. most of them) don’t. The link to the tool you posted as part of your question shows you for example that ૫ does not change under any form of Unicode normalisation.
You are looking the wrong problem.
Unicode is main purpose is about encoding characters (without loosing information). Fonts and other programs should be able to interpret such characters and give a glyph (according combination code point, nearby characters, and other characteristics outside code points [like language, epoch, font characteristic [script and non-script, uppercase, italic, etc changes how to combine characters and ligature (and also glyph form).
There are two main normalization (canonical and compatible) [and two variant: decomposed, and composed when possible]. Canonical normalization remove unneeded characters (repetition) and order composing characters in a standard way. Compatible normalization remove "compatible characters": characters that are in Unicode just not to lose information on converting to and from other charset.
Some digits (like small 2 exponent) have compatible character as normal digit (this is a formatting question, unicode is not about formatting). But on the other cases, digits in different characters should be keep ad different characters.
That was about normalization.
But you want to get the numeric value of a unicode character (warning: it could depends on other characters, position, etc.).
Unicode database provides also such property.
With Javascript, you may use unicode-properties javasript package, which provide you also the function getNumericValue(codePoint). This packages seems to use efficient compression of database, but so I don't know how fast it could be. The database is huge.
The NFKC which you are using here stands for Compatibility Decomposition followed by Canonical Composition, which in trivial english means first break things to smaller more often used symbols and then combine them to find the equivalent simpler character. For example 𝟘->0, fi->fi (Codepoint fi=64257).
It does not do conversion to ASCII, for example in ख़(2393)-> ख़([2326, 2364])
Reference:https://unicode.org/reports/tr15/#Norm_Forms
For simpler understanding:https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
I'm working on a client side app. Users can select a few widgets on the page and share their selection with friends by sending them the URL of the page. I'm planning on saving the user's widget selections via a query string. I'd like the URL to be as small as possible so that it's easier for people to share.
Now to my question. I have a string of characters (8) that I'd like to encode so that output of the encoding is significantly smaller. I realize that 8 characters isn't very big but it's got potential to get larger in the future.
//using hex encoding results in a saving of 1 character
(98765432).toString(16) //"5e30a78"
example.com?q=98765432 vs example.com?q=5e30a78
Ideally I'd like the new string to be 4 characters or less. What are my options for encoding a string that will be used in URLs?
I've looked at this question: How can I quickly encode and then compress a short string containing numbers in c# but the encoded string is still too long.
Short tale about compression:
Let's say that you have an alphabet A and you have a set of words W(A) in alphabet A. Consider function
f: W(A) -> W(A)
which takes a word w and maps it into a word f(w) in the same alphabet.
Now it can be shown that if this function is invertible and there is a word w1 such that
length(f(w1)) < length(w1)
(i.e. we've compressed the word) then there exists a word w2 such that the opposite holds
length(f(w2)) > length(w2)
So this means that every compression method you've ever heard of is actually an illusion. For every method there is a file that will be larger after compression. It works because compression methods make assumptions about initial files. For example that these are words written in natural language. They are optimized for such cases and fail for other cases like whitenoise.
Back to your problem. If you wish to compress [a-zA-Z0-9] words onto itself and all cases are possible then you are doomed.
But there are at least two things you can think about:
Find most common [a-zA-Z0-9] words and map them onto small words. For example you found out that the case example.com?q=98765432 is most common among your users. Then you will map it to example.com?c=1 (note the parameter change). You will need a dictionary for such mappings. Of course for same rare cases you will end up with larger url, e.g. example.com?q=abcd will be mapped to example.com?c=abcdefgh unfortunately.
Restrict your input alphabet and enlarge your output alphabet. The bigger the difference, the bigger real compression is possible. Note that unfortunately there is a quite low upper limit for the alphabet used in URLs, namely 128 (ascii characters). For example if you have alphabet A={1,2} and B={1,2,3,4,5,6} then you can map 1~1, 2~2, 11~3, 12~4, 21~5, 22~6 which basically means that every word in A can be written in B in such a way that you reduce the size by half.
I use "".charCodeAt(pos) to get the Unicode number for a strange character, and then String.fromCharCode for the reverse.
But I'm having problems with characters that have a Unicode number greater than 55349. For example, the Blackboard Bold characters. If I want Lowercase Blackboard Bold X (𝕩), which has a Unicode number of 120169, if I alert the code from JavaScript:
alert(String.fromCharCode(120169));
I get another character. The same thing happens if I log an Uppercase Blackboard Bold X (𝕏), which has a Unicode number of 120143, from directly within JavaScript:
s="𝕏";
alert(s.charCodeAt(0))
alert(s.charCodeAt(1))
Output:
55349
56655
Is there a method to work with these kind of characters?
Internally, Javascript stores strings in a 16-bit encoding resembling UCS2 and UTF-16. (I say resembling, since it’s really neither of those two). The fact that they’re 16-bits means that characters outside the BMP, with code points above 65535, will be split up into two different characters. If you store the two different characters separately, and recombine them later, you should get the original character without problem.
Recognizing that you have such a character can be rather tricky, though.
Mathias Bynens has written a blog post about this: JavaScript’s internal character encoding: UCS-2 or UTF-16?. It’s very interesting (though a bit arcane at times), and concludes with several references to code libraries that support the conversion from UCS-2 to UTF-16 and vice versa. You might be able to find what you need in there.
In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.