Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Is there a reason that when sanitizing a string, the characters are converted to lowercase as opposed to uppercase?
I've see this convention in many languages, but in terms of my current environment, we'll say Rails and/or Javascript
No specific reason to my knowledge, but neither uppercasing nor lowercasing is the whole story in the Unicode world.
For example, the German letter ß is exactly equivalent to ss; they're both lowercase, and a word spelled with ß can also be spelled with ss.
Conversely, in Turkish, ı (dotless i) is distinct from i (dotted i), but unless your locale is Turkish, uppercasing either one produces I (dotless ASCII I). This changes meaning too. You don't want to use the wrong one; they aren't equivalent.
Because of this, some programming languages offer more specific "case normalizing" conversions per the case folding rules in section 3.13 of the Unicode standard; Python 3.3 introduced str.casefold for that reason. It's much like .lower(), but will also normalize stuff like ß to ss because they're logically equivalent (if you're uniquifying, you wouldn't want to treat two strings that differ only in ß vs. ss to be treated as different).
If you don't have case folding available in your language, then the distinction between normalizing as upper vs. lower case is mostly by convention.
Javascript has toLowerCase() as well as toUpperCase(). You can use either!
I think the answer to your question though really stems from unix systems deciding many decades ago to use case sensitivity and having all lower case commands. This translated to case sensitive urls in Apache, and to be cross O/S compatible, we just made sure everything was always lowercase.
I guess all upper case could be and is used at times, but it's also obnoxious :)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I need basically all language-specific tokens, symbols for javascript. Basically, all the keywords, all the identifiers, all the punctuators, all the logical operation symbols and etc.
Where can I actually find it?
The complete lexical specification for ECMAScript (aka JavaScript) can be found in Section 11 (Lexical Grammar) of ECMA 262. (New versions of ECMA-262 are released roughly annually, but the URL in that link -- correct as of ECMAScript 2020 -- should continue to work, or be easy to fix. I didn't transcribe the list of reserved words, semi-reserved words, operators and other punctuators, because those lists may well change in a future standard.)
However, precisely specifying which IdentifierNames have specific meaning is not simple. You need to carefully read subsection 11.6 (Names and Keywords), and it will probably still be a bit confusing. The difficulty is that each successive version of ECMAScript tries extraordinarily hard to stay compatible with previous versions, making it hard to introduce new keywords. So many significant symbols are not in the list of reserved words. Some, like await are only reserved in certain contexts; many others (such as let) are only reserved in strict mode; and a few (such as async) are always usable as identifiers, but can be used syntactically in some contexts where an identifier would be a syntax error. There are various lists of symbols in section 11.6.2, which you could combine or select from, depending on what your need is.
Operator symbols (other than word operators) are a bit more straightforward (although there are still some oddities, so reading the fine print is still necessary). See the various lists in section 11.7.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm somewhat new to regex. I understand most of the basics but what I'm trying to do is beyond my knowledge, and may not even be possible.
I'm trying to make a regex in JavaScript that can match a series of function calls in the following pattern.
Name.Name(Params).Name(Params)
The names could be any standard java function name. I understand how do to this part. The params though can be different number of parameters (Currently only 0-2)
My biggest issue however is that params could potentially take ANY string with either a single or double quotation mark, or variable names. I have added some examples below as I need all of these to work with my regular expression (if Possible).
Examples:
Func.Foo().Bar()
Foo.Bar('foo', bar).Foobar()
Foo.Bar("foo", "bar").bar(')')
Foo.Bar('/"foo/"').bar("foo(bar/")")
My main concern here is I cant just look for a opening and parentheses or even 2 quotation marks.
Is it possible to use a regex so that I can parse the function call and parameters out?
The short answer to the Question in the title is yes, you can build a regex that matches any substring. But unfortunately that is not what you want. If you allow arbitrary substrings your regex will either match many cases you dont want to match or it will become extremely complex (see the email regex for an example).
What you want is a tokenizer!(https://medium.freecodecamp.org/how-to-build-a-math-expression-tokenizer-using-javascript-3638d4e5fbe9)
Edit: for the solutions in the comments: the ast parser is for java, the author wants to use javascript.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to identify the difference between Brazil and european portuguese
either with
Character Sets or unicodes or ascii letters or regex
or with trigrams used to identify the difference in these two languages.
most of the language detectors like NTextCart, guesslanguages.js does not identify the difference in language. can any one have the solution for this issue.
Thanks in Advance :)
It's not different from telling apart US english and UK english
You must know both languages and seek for very specific differences. It's a tricky and not accurate way. Also you may need to get the context of the message to get the meaning of the words.
Even a native portuguese speaker can have hard time telling them appart, it's even worse for small texts.
To get an example get search for the same topic (example, Clinton x Trump debate) in brazilian and portuguese news sites and try to read them and see the diffrences. You will got an idea.
Also put in mind if you are getting casual chatting you will need to handle slangs, mispellings and region specific expressions from each country.
After reading how Guesslanguagew uses trigram analysis
I see it ill get abad time telling dialects apart. There are few words with different spelling.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I see most code-bases using bracketed if-statements in all cases including cases in which only one statement needs to be executed after the if.
Are there any clear benefits to always using bracketed if-statements or has it just become a standard over time?
What is the general consensus on this, would it be a big mistake to allow one-line ifs through our linter?
Largely this comes down to developer preference, although many reasons are given for using the curly braces on every control block.
First, it makes the control block more readable. Consider:
if (some_condition)
runSomeFunction();
runSomeOtherFunction();
Since indentation is not respected in most curly brace languages, this will work, but it really reduces the readability (only runSomeFunction() will happen in the control block). Compare to:
if (some_condition) {
runSomeFunction();
}
runSomeOtherFunction();
Second, when you need to add something to the control block (which almost invariably happens more often than not), adding the curly's can be frustrating or easily forgotten leading to issues like the above.
Still, those largely come down to preference and you can always find exceptions (like if (some_condition) runSomeFunction(); which is much more readable than the first example above while still accomplishing the same goal in a much more concise format that retains readability).
If you have to go back to the code to add something, you might forget that you didn't open brackets, and your code wouldn't work if you exceed one line.
Other than that it's a matter of preference and format.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I understand that using that function will make an entire string all lowercase. However, I'm curious in the behind the scenes work. I can't find an explanation anywhere on how it works. Does it basically loop through every index in the string and check to see what the character is and if there is a lower case character available?
In general the best place to look for such information is the ECMAScript specification:
The following steps are taken:
Call CheckObjectCoercible passing the this value as its argument.
Let S be the result of calling ToString, giving it the this value as its argument.
Let L be a String where each character of L is either the Unicode lowercase equivalent of the corresponding character of S or the actual corresponding character of S if no Unicode lowercase equivalent exists.
Return L.
For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping.
The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later).
Step 3 is the part you're really interested in. As you can see, the details of how "L" is produced are up to the implementation. If you're interested in going deeper the next place to look would be e.g. the V8 engine itself.