Are simpleflake ids generarated in different languages consistent? - javascript

I need to generated unique Ids in distributed manner. Some at server side and another at client side. Server side programming language can be ruby and python while client side is javascript.
I am plannning to use simpleflake libraries for respective languages.
Can I assume that the ids will never collide?
OR they can collide often, due to the implementation details in different packages?
Thanks in advance.
-Amit

Python's Simpleflake and Node.js's simpleflakes are actually derived from same origin (python's implementation). Both generate 64bit IDs And IDs generated by both are compatible with each other.
The simple flake generates id with the formula
flake = (int((time.time() - 946702800) * 1000) << 23) + random.SystemRandom().getrandbits(23)
As pointed out in earlier answer, the probability of collision is really low (It is the probability of clashing 41 bit timestamp in milliseconds and the probability of randomly generated 23 bit integer).
However, it'd be important to know the difference between two implementations mentioned above. The simpleflakes node.js library fixes its epoch at 2000-01-01T00:00:00.000Z whereas python implementation assumes the epoch at 2000-01-01T05:00:00.000Z.

I haven't used Simpleflake itself, but have been using a similar scheme for years, though I use 128 bits instead of 64.
The key ingredient is that most of the bits are random. So even if your libraries choose a slightly different number of bits for the timestamp portion, or a different granularity then the likelihood of collisions is low. Of course, in such cases it lessens the speed improvements in the database.
I imagine that some Simpleflake implementation is "standard" and the other implementations are straight ports---keeping compatibility and characteristics. If not, shame on them for using Simpleflake in their name.

Related

Store language (ISO 639) as Number

I'm working on a MongoDB database and so far have stored some information as Numbers instead of Strings because I assumed that would be more efficient. For example, I store countries following ISO 3166-1 numeric and sex following ISO/IEC 5218. But so far I have not found a similar standard for languages, ISO 639 does not appear to have a matching list of numeric codes.
What would be the right way to do this? Should I just use the String codes?
Thanks!
If you're a fan of the numbers, you can use country calling codes, although they "only" represent the ITU members (193 countries according to Wikipedia). But hey, they have Somalia and Palestine, so that's a good hint about how global this is.
However, storing everything in an encoded format (numbers here) implies a decoding step on the fly when any piece of data is requested (with translation tables stored in RAM instead of DB's ROM). Probably on the server whose CPU is precious, but you might have deported the issue on the client, overworking the precious, time-critical server-client link in the process.
So, back in the 90s, when a 40MB HDD was expensive, that might have been interesting. Today, the cost of storing data vs. the cost of processing data is not on the same side of 1... Not counting the time it takes you to think and implement the transformations. All being said "IMHO", I think this level of efficiency actually kills efficiency. ;)
EDIT: Oops, just realized I misthought (does that verb even exist?) the country/language issue. Countries you have sorted out already, my bad. I know no numbered list of languages. However, the second part of the post might still be relevant...
If you are after raw performance and/or want to achieve really small data sizes, I would suggest you use either the three-letter (higher granularity) or the two-letter (lower granularity) codes from IOC ISO-639-1/2.
To my knowledge, there's no helper or anything for this standard built into any programming language that I know, so you'd need to build your own translator (code<->full name) which, however, should be trivial.
And as others already mentioned, you have to assess the cost involved with this (e.g. not being able to simply look at the data and understand it right away anymore) for yourself. I personally do recommend keeping data sizes small since BSON parsing and string operations are horribly expensive compared to dealing with numbers (or shorter strings for that matter). When dealing with small data sets, this won't make a noticeable difference. If, however, you need to churn through millions of documents or more optimizations like this can become mission critical.

Which seeds are used for native random number generators in common languages?

I am interested in finding out which seeds are used for native random number generators in common languages. Primarily, it's Javascript, Objective C, Swift and Java.
If you want to generate unique ids in distributed systems, you want to minimise the risk of collision. One strategy is to use a UNIX timestamp concatenated with a random number. However, if UNIX timestamp is also used as the sole seed for the random number generator, there is no point in adding a random number to the timestamp. If two units calculated an id at the same time using the same pseudo-random generator, they would then return the same random number as well. Using a hardware-specific id as part of the seed would be a good strategy, I think. But how is it actually implemented already in these languages?
This is a platform/framework question, not a language question.
I would suggest generating a UUID on all platforms. UUIDs are designed to be completely unique. iOS/Mac OS has NSUUID. I don't know about the other platforms.

Rough Unicode -> Language Code without CLDR?

I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.
e.g.
字 - returns ['zh', 'ja', 'ko']
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']
I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948
Or Google API Python - can I detect unicode string language code?
But in my case
Looking up a large charmap db seems cost a lot of storage and memory
Too slow to call an API, besides it requires a network connection
don't need to be very accurate. just about 80% correct ratio is acceptable
simple & fast is the main requirement
it's OK to just cover UCS2 BMP characters.
Any tips?
I need to use this in Python and Javascript. Thanks!
Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/
Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.
Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.
If the number of languages is relatively small (or the number you care about is fairly small), you could use a Bloom filter for each language. Bloom filters let you do very cheap membership tests (which can have false positives) without having to store all of the members (in this case the code points) in memory. Then you build your result set by checking the code point against each language's preconstructed filter. It's tuneable - if you get too many false positives, you can use a larger size filter, at the cost of memory.
There are Bloom filter implementations for Python and Javascript. (Hey - I've met the guy who did this one! http://www.jasondavies.com/bloomfilter/)
Bloom filters: http://en.m.wikipedia.org/wiki/Bloom_filter
Doing a bit more reading, if you only need the BMP (65,536 code points), you could just store a straight bit set for each language. Or a 2D bitarray for language X code point.
How many languages do you want to consider?

How can dates and random numbers be used for evil in Javascript?

The ADsafe subset of Javascript prohibits the use of certain things that are not safe for guest code to have access to, such as eval, window, this, with, and so on.
For some reason, it also prohibits the Date object and Math.random:
Date and Math.random
Access to these sources of non-determinism is restricted in order to make it easier to determine how widgets behave.
I still don't understand how using Date or Math.random will accomodate malevolence.
Can you come up with a code example where using either Date or Math.random is necessary to do something evil?
According to a slideshow posted by Douglas Crockford:
ADsafe does not allow access to Date or random
This is to allow human evaluation of ad content with confidence that
behavior will not change in the future. This is for ad quality and
contractual compliance, not for security.
I don't think anyone would consider them evil per se. However the crucial part of that quote is:
easier to determine how widgets behave
Obviously Math.random() introduces indeterminism so you can never be sure how the code would behave upon each run.
What is not obvious is that Date brings similar indeterminism. If your code is somehow dependant on current date it will (again obviously) work differently in some conditions.
I guess it's not surprising that these two methods/objects are non-functional, in other words each run may return different result irrespective to arguments.
In general there are some ways to fight with this indeterminism. Storing initial random seed to reproduce the exact same series of random numbers (not possible in JavaScript) and supplying client code with sort of TimeProvider abstraction rather than letting it create Dates everywhere.
According to their website, they don't include Date or Math.random to make it easier to determine how third party code will behave. The problem here is Math.random (using Date you can make a psuedo-random number as well)- they want to know how third party code will behave and can't know that if the third party code is allowed access to random numbers.
By themselves, Date and Math.random shouldn't pose security threats.
At a minimum they allow you to write loops that can not be shown to be non-terminating, but may run for a very long time.
The quote you exhibit seem to suggest that a certain amount of static analysis is being done (or is at least contemplated), and these features make it much harder. Mind you these restrictions aren't enough to actually prevent you from writing difficult-to-statically-analyze code.
I agree with you that it's a strange limitation.
The justification that using date or random would make difficult to predict widget behavior is of course nonsense. For example implement a simple counter, compute the sha-1 of the current number and then act depending on the result. I don't think it's any easier to predict what the widget will do in the long term compared to a random or date... short of running it forever.
The history of math has shown that trying to classify functions on how they compute their value is a path that leads nowhere... the only sensible solution is classifying them depending on the actual results (black box approach).

Is Math.random() cryptographically secure?

How good are algorithms used in Javascript Math.random() in different browsers? Is it okay to use it for generating salts and one-time passwords?
How many bits from one random I can use?
Nope; JavaScript's Math.random() function is not a cryptographically-secure random number generator. You are better off using the JavaScript Crypto Library's Fortuna implementation which is a strong pseudo-random number generator (have a look at src/js/Clipperz/Crypto/PRNG.js), or the Web Crypto API for getRandomValues
Here is a detailed explanation: How trustworthy is javascript's random implementation in various browsers?
Here is how to generate a good crypto grade random number: Secure random numbers in javascript?
It is not secure at all, and in some cases was so predictable you could rebuild internal state of the PRNG, deduct the seed and thus could use it to track people across websites even if they didn't use cookies, hid behind onion routing etc...
2022 edit since this answer still gets upvotes: use Crypto.getRandomValues if you need a cryptographic RNG in JavaScript
http://landing2.trusteer.com/sites/default/files/Temporary_User_Tracking_in_Major_Browsers.pdf a 2008 paper exposing the user tracking possibilities of the browser weak PRNG
http://dl.packetstormsecurity.net/papers/general/Google_Chrome_3.0_Beta_Math.random_vulnerability.pdf a later (2009) Chrome vulnerability, as the problem was already well known
As of March 2013, window.crypto.getRandomValues is an "experimental technology" available since Chrome 11 and Firefox 21 that lets you get cryptographically random values. Also, see getRandomValues from the lastest W3C Web Cryptography API draft.
Description:
If you provide an integer-based TypedArray (i.e. Int8Array,
Uint8Array, Int16Array, Uint16Array, Int32Array, or Uint32Array), the
function is going fill the array with cryptographically random
numbers. The browser is supposed to be using a strong (pseudo) random number generator. The method throws the QuotaExceededError if the requested length is greater than 65536 bytes.
Example:
var array = new Uint32Array(10);
window.crypto.getRandomValues(array);
console.log("Your lucky numbers:");
for (var i = 0; i < array.length; i++) {
console.log(array[i]);
}
Also, an answer to How random is JavaScript's Math.random? refers to Temporary user tracking in major browsers and Cross-domain information leakage and attacks from 2008 which discusses how the JavaScript Math.random() function leaks information.
Update: For current browser support status, check out the Modern.IE Web Crypto API section, which also links to the Chrome, Firefox, and Safari bug reports.
Because you cannot know the exact implementation of the browser (except for closed user groups like for your business intranet) I would generally consider the RNG weak.
Even if you can identify the browser you don't know if the browser itself or any other browser's agent ID is manipulated. If you can you should generate the number on the server.
Even if you include a good PRNG in your JavaScript your server cannot know whether the request from the client originates from an unmodified script. If the number goes into your database and/or is used as a cryptographic tool it is no good idea to trust the data from the client at all. That is true not only for validity (You do validate all data coming from the client, don't you?) but also for general properties like randomness.
Math.random() is not cryptographically secure. Also Veracode will point this occurrence with
CWE-331 (Insufficient Entropy)
We could make use of SecureRandom to implement similar functionality.
new SecureRandom().nextDouble();

Categories

Resources