I'm trying to include a field in Mongodb document called myId. I am using shortid. I am wondering, in case of big data, like millions of documents in a collections:
What's the guarantee that the shortid will be always unique and never ever be repeated for any other document?
What keeps a track of the generated ids?
What are the chances of the id been repeated?
What's the guarantee that the shortid will be always unique and never ever be repeated for any other document
to cut a long story short: these shortids are pretty much just "hashed" timestamps. Not unix timestamps, their own breed, but non the less not much more than timestamps.
All that bling with Random is pretty much that, just bling.
As long as all these shortids are generated on the same computer (a single thread) with the same seed, collisions are impossible.
What keeps a track of the generated ids?
A counter that gets incremented when a you request ids to fast, so that the same timestamp is hit. This counter is reset to 0 as soon as a new timestamp is reached.
There's nothing significant, that is really random in there.
What are the chances of the id been repeated?
during usage, little to non existant.
As far as I can tell, the only two things that may lead to a collision are
changing the seed for the prng (leads to a new alphabet, so that newer dates may be encoded to ids that already have been generated with a different seed; although not very likely, but possible)
generating ids on multiple threads/machines because the counter is not synced.
Summary: I'd nag about pretty much everything in that code, but even as it is, it does the job, reliable. And I've told you the limitations.
Shortid generates a random 64 bit id. This is done in multiple step, but the base of it is this pseudo-random function:
function getNextValue() {
seed = (seed * 9301 + 49297) % 233280;
return seed/(233280.0);
}
To generate the same id twice, this function has to return the same exact values in the same exact order in the same exact second. This is very rare, but can happen if they reset the timer (based on the comments, they do, but it's still rare).
Related
I am planning to implement my own very simple "hashing" formula to add a layer of security to an app with multiple users. My current plan is as follows:
User creates an account at which point an ID is generated on the backend. The ID is run through a formula (let's say ID * 57 + 8926 - 36 * 7, or something equally random). I then send back to the frontend the new user ID and the new "hashed" number and store them in localStorage.
User tries to access a secured area (let's say a settings page so they can change their own settings).
I send the backend two values: their ID and the hashed number. I run the ID through the same formula to check it matches the hashed value I've received. If the check passes, they can get in. So if someone has tried, say, changing their ID in localStorage to get access to another user's settings page, the only way they could achieve that is if they guess what the formula was. They could easily guess a user ID, but guess that the corresponding number is the result of ID * 57 + 8926 - 36 * 7 seems pretty unlikely.
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value... I think? Would it make more sense to use a package to create some kind of primary key/uuid instead of "hashing" my own value and doing a db lookup each time?
Tech stack: React on FE, Python on BE, SQL db.
I see a lot of posts saying "don't roll your own" -- is this absolute?
Yes it is. The reason being that whenever a non-cryptographer tries their hand at developing their own algorithms, they invariably fall into a multitude of pit holes which render the security of the algorithm to next to useless.
Your particular scheme, for example, can be trivially broken given two consecutive ID and "hash" pairs. (It's a simple arithmetic sequence, deriving the formula of an arithmetic sequence given two consecutive values is grade ~6 level math.)
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value...
The performance difference would probably be negligible. Don't worry about it.
If the information is not particularly sensitive, just assign each user a randomly generated 128 bit number. The chances of someone guessing a valid user's number are practically zero.
Two property of real hashes that you are missing with this are
a simple change in the input causes a large change in the output
all hashes have the same length
This could be a problem if a user somehow knows their own id and hash.
With your selfmade hash I could easily find out the hash of a random other user by reverse engeniering the hash.
I have a small website, when you go into it it'll show you a quote.
Till today what I was doing is, when a user goes to my website a random quote that directly comes from the database will be shown (when I say directly I mean a connection was made to the database and return a quote from it) but sometimes it took some time like 1 or 2 seconds, today I did something different when my nodejs application starts I grab every quote in the database and store them inside an Array. So when someone comes to my website I'll randomly choose a quote in the Array, and it was so much faster compared to the first way of doing it and I make some changes so when I add new quote to the database the Array automatically updated.
So here is my question, is it bad to store data inside an array and serve users with it?
There will be a few different answer according to your intentions. First of all, if the dataset of quotes are a lot in quantity. I assure you it is a very bad idea but if you are talking about a few items. Well, it's acceptable. However, if you are building a scalable application, it's not much recommended because you will keep all copies of the dataset in each node etc.
If you want a very fast quote storage, I would recommend redis (a key value storage for RAM). It shares the state for each node which means your all nodes connect to redis and the quotes are kept in redis so that you do not need to keep the copies and it becomes fast. Also, if you activate the disk record option, you can use redis as your primary quote storage. In the end, you won't update these quotes too much and they won't be searched with a complex query.
Alternatively, if your database is mysql, postgre or mongodb, you can activate ram storage option so that you don't need to keep that data on your array but directly take it form db which is much more fast but also queryable.
There's the old joke: The two hard things in software engineering are naming things, caching things, and off-by-one errors.
You're caching something: your array of strings. Then you select one at random from the array each time you need one.
What is right? You get your text string from memory, and eliminate the time-delay involved in getting it from the database. It's a good optimization.
What can go wrong?
Somebody can add or remove strings from your database, which makes your cache stale.
You can have so many text strings you blow out your nodejs RAM. This seems unlikely; it's hard to imagine a list of quotes that big. The Hebrew Bible, the New Testament, and the Qur'an together comprise less than a million words. You probably won't have more text in your quotable-quotes than that. 10-20 megabytes of RAM is nothing these days.
So, what about your stale cache in RAM? What to do?
You could ignore the problem. Who cares if the cache is stale?
You could reread the cache every so often.
Your use of RAM for this is a good optimization. But, it adds a cache to your application. A cache adds complexity, and the potential for a bug. Is the optimization worth the trouble? Only you can guess the answer to that question.
And, it's MUCH MUCH better than doing SELECT ... ORDER BY RAND() LIMIT 1; every time you need something random. That is a notorious query-performance antipattern.
I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?
Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.
Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.
It seems django, or the sqlite database, is storing datetimes with microsecond precision. However when passing a time to javascript the Date object only supports milliseconds:
var stringFromDjango = "2015-08-01 01:24:58.520124+10:00";
var time = new Date(stringFromDjango);
$('#time_field').val(time.toISOString()); //"2015-07-31T15:24:58.520Z"
Note the 58.520124 is now 58.520.
This becomes an issue when, for example, I want to create a queryset for all objects with a datetime less than or equal to the time of an existing object (i.e. MyModel.objects.filter(time__lte=javascript_datetime)). By truncating the microseconds the object no longer appears in the list as the time is not equal.
How can I work around this? Is there a datetime javascript object that supports microsecond accuracy? Can I truncate times in the database to milliseconds (I'm pretty much using auto_now_add everywhere) or ask that the query be performed with reduced accuracy?
How can I work around this?
TL;DR: Store less precision, either by:
Coaxing your DB platform to store only miliseconds and discard any additional precision (difficult on SQLite, I think)
Only ever inserting values with the precision you want (difficult to ensure you've covered all cases)
Is there a datetime javascript object that supports microsecond accuracy?
If you encode your dates as Strings or Numbers you can add however much accuracy you'd like. There are other options (some discussed in this thread). Unless you actually want this accuracy though, it's probably not the best approach.
Can I truncate times in the database to milliseconds..
Yes, but because you're on SQLite it's a bit weird. SQLite doesn't really have dates; you're actually storing the values in either a text, real or integer field. These underlying storage classes dictate the precision and range of the values you can store. There's a decent write up of the differences here.
You could, for example, change your underlying storage class to integer. This would truncate dates stored in that field to a precision of 1 second. When performing your queries from JS, you could likewise truncate your dates using the Date.prototype.setMilliseconds() function. Eg..
MyModel.objects.filter(time__lte = javascript_datetime.setMilliseconds(0))
A more feature complete DB platform would handle it better. For example in PostgreSQL you can specify the precision stored more exactly. This will add a timestamp column with precision down to miliseconds (matching that of Javascript)..
alter table "my_table" add "my_timestamp" timestamp (3) with time zone
MySQL will let you do the same thing.
.. or ask that the query be performed with reduced accuracy?
Yeah but this is usually the wrong approach.
If the criteria you're filtering by is to precise then you're ok; you can truncate the value then filter (like in the ..setMilliseconds() example above). But if the values in the DB you're checking against are too precise you're going to have a Bad Time.
You could write a query such that the stored values are formatted or truncated to reduce their precision before being compared to your criteria but that operation is going to need to be performed for every value stored. This could be millions of values. What's more, because you're generating the values dynamically, you've just circumvented any indexes created against the stored values.
I need to create complete random numbers using a seed in JavaScript. I'm not using the built-in Math.random(), but rather something else that can take a seed and generate a random number based on that.
This solution is supposed to serve a situation in which two users log in at the same time to a website (it happens A LOT, and I'm getting a lot of users with identical IDs because of it). Math.random() isn't working for me, and I can't use timestamps because these also don't provide an accurate number (they're not being sampled every MS). I also don't want to use any ajax call in order to get an IP or something similar.
Is there anything anyone can think of that's either unique, or might be rare enough to use to create a good seed?
**EDIT: ** This is NOT a duplicate of the GUID generation question, because that one is still using Math.random(). I can't use that function anywhere in my code. I sometime have thousands of hits at more-or-less the exact same moment, and that's what screws up the random. It's also the reason why I need to find some attribute I can use as a seed.