I have some Event objects that have a lifespan attribute. I have varying types of events, each with different lifespans and potentially different units (e.g. 4-10 hours, 5-8 weeks, 1 month - 2 years). I want to store these ranges as a uniform and comparable datatype but I'm not sure what my best option is. Ideally I want to be able to go through a list of Events and find all that can last for 3 hours, 1 week, 2 months, etc.
The problem you are having is comparing values that are the same "substance" (i.e., time) yet have different units. This would be a great candidate for a value object. Value objects are not a common pattern in JavaScript, but they are no less applicable and are a great place for domain logic. This article is a great look at using value objects with JavaScript.
Create a "class" by defining a JavaScript function.
In it, store a value for lifespan in the smallest units that you use (hours).
Create several internal methods that will get and set this value based on the unit type (for example, multiplying weeks by 168 to get number of hours). You may want to round to the nearest week/month/year when getting in those units.
Create methods to compare the value in hours with the value in hours of another instance.
This will allow you to compare events independently of storing the unit of time that their lifespan is.
Related
I am new to machine learning so still trying to wrap my head around concepts, please bear this in mind if my question may not be as concise as needed.
I am building a Tensorflow JS model with LSTM layers for time-series prediction (RNN).
The dataset used is pinged every few hundred milliseconds (at random intervals). However, the data produced can come in very wide ranges e.g. Majority of data received will be of value 20, 40, 45 etc. However sometimes this value will reach 75,000 at the extreme end.
So the data range is between 1 to 75,000.
When I normalise this data using a standard min/max method to produce a value between 0-1, the normalised data for the majority of data requests will be to many small significant decimal places. e.g.: '0.0038939328722009236'
So my question(s) are:
1) Is this min/max the best approach for normalising this type of data?
2) Will the RNN model work well with so many significant decimal places and precision?
3) Should I also be normalising the output label? (of which there will be 1 output)
Update
I have just discovered a very good resource on a google quick course, that delves into preparing data for ML. One technique suggested would be to 'clip' the data at the extremes. Thought I would add it here for reference: https://developers.google.com/machine-learning/data-prep
After doing a bit more research I think I have a decent solution now;
I will be performing two steps, with the first being to use 'quantile bucketing' (or sometimes called 'binning' ref: https://developers.google.com/machine-learning/data-prep/transform/bucketing).
Effectively it involves splitting the range of values down into smaller subset ranges, and applying an integer value to each smaller range of values. e.g. A initial range of 1 to 1,000,000 could be broken down into ranges of 100k. So 1 to 100,000 would be range number 1, 100,001 to 200,000 would be range number 2.
In order to have an even distribution of samples within each bucket range, due to the skewed dataset I have, I modify the subset ranges so they capture roughly the same samples in each 'bucket' range. For example, the first range of the example above, could be 1 to 1,000 instead of 1 to 100,000. The next bucket range would be 1,001 to 2,000. The third could be 2,001 to 10,000 and so on..
In my use case I ended up with 22 different bucket ranges. The next step is my own adaptation, since I don't want to have 22 different features (as seems to be suggested in the link). Instead, I apply the standard min/max scaling to these bucket ranges, resulting in the need for only 1 feature. This gives me the final result of normalised data between 0 and 1, which perfectly deals with my skewed dataset.
Now the lowest normalised value I get (other than 0) is 0.05556.
Hope this helps others.
I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?
1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?
So let's say I have a sensor that's giving me a number, let's say the local temperature, or whatever really, every 1/100th of a second.
So in a second I've filled up an array with a hundred numbers.
What I want to do is, over time, create a statistical model, most likely a bell curve, of this streaming data so I can get the population standard deviation of this data.
Now on a computer with a lot of storage, this won't be an issue, but on something small like a raspberry pi, or any microprocessor, storing all the numbers generated from multiple sensors over a period of months becomes very unrealistic.
When I looked at the math of getting the standard deviation, I thought of simply storing a few numbers:
The total running sum of all the numbers so far, the count of numbers, and lastly a running sum of (each number - the current mean)^2.
Using this, whenever I would get a new number, I would simply add one to the count, add the number to the running sum, get the new mean, add the (new number - new mean)^2 to the running sum, divide that by the count and root that, to get the new standard deviation.
There are a few problems with this approach, however:
It would take 476 years to overflow the sum of numbers streaming in assuming the data type is temperature and the average temperature is 60 degrees Fahrenheit and the numbers are streamed at a 100hz.
The same level of confidence cannot be held for the sum of the (number - mean)^2 since it is a sum of squared numbers.
Most importantly, this approach is highly inaccurate since for each number a new mean is used, which completely obliterates the entire mathematical value of a standard deviation, especially a population standard deviation.
If you believe a population standard deviation is impossible to achieve, then how should I go about a sample standard deviation? Taking every nth number will still result in the same problems.
I also don't want to limit my data set to a time interval, (ie. a model for only the last 24 hours of sensor data is made) since I want my statistical model to be representative of the sensor data over a long period of time, namely a year, and if I have to wait a year to do testing and debugging or even getting a useable model, I won't be having fun.
Is there any sort of mathematical work around to get a population, or at least a sample standard deviation of an ever increasing set of numbers, without actually storing that set since that would be impossible, and still being able to accurately detect when something is multiple standard deviations away?
The closest answer I've seen is: wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm, however, but I have no idea what this is saying and if this requires storage of the set of numbers.
Thank you!
Link shows code, and it is clear that you need to store only 3 variables: number of samples so far, current mean and sum of quadratic differences
I want to store an array of objects that contain the property event_timestamp that contains a Date() object of the timestamp of the event.
i'll like to provide a from Date object and a to Date object and to collect all the objects within that range.
I don't want to iterate over all the array, i'd like something faster... for example O(Log N) or anything else.
is there something like that implemented in javascript? is there a 3rd party library that provides the search by O(Log N) or any other fast method?
any information regarding the issue would be greatly appreciated.
thanks!
update
thanks for your responses so far.
my scenario is that i have a dashboard that contains graphs. now user can search for example.. give me the result of the last 15 minutes. then the server provide rows with the relevant results and each row in the result contains a property called event_timestamp that contains a Date Object.
now, if after 5 minutes the user searches for the last 15 minutes again, then it means that the first 10 minutes he already queried before, so I want to cache that and to send a request from the server only for the last 5 minutes.
so whenever the user queries the server, i take the response and parse it using the following steps:
get event_time on first row
get event_time of last row
append to an array the following object:
{
rows: rows,
fromDate: firstRowDate,
toDate: lastRowDate
}
now.. in the previous example the user first queries for 15 minutes, and after 5 minutes queried for 15 minutes again. which means i have 10 valuable minutes of data in the object.
the main issue here is how do i iterate over the rows fast to find the relevant rows that i need to use for my graph.
in a simple way i can just iterate through all the rows and find the ones that are in the range of fromDate to toDate. but if i'll deal with a few million rows that's gonna be a hassle for the client.
I use google flat buffer to transfer the data and then I build the rows by myself. so I can save them in any other format.
hope this helps understand my needs. thanks!
With your current way of storage there is no other or faster way than to iterate over the whole array. So your complexity will be O(n) just for traversing the array.
The information you need to access is stored inside an object which itself is stored in the array. How else would you be able to access the information than to look at each single object. Even in other languages with different array implementations, there wouldn't be a faster way.
The only way you could make this faster is by changing the way you store your data.
For example you could implement a BST with the date as the index. As this would be sorted by date, you wouldn't have to traverse the whole tree and you would need fewer operations to find the nodes in range. I'm afraid you would have to build it yourself though.
Update:
This was related to your initial question. Your update goes in a whole new direction.
I don't think your approach is a good idea. You won't be able to handle such a big amount of data, at least not in a performant way. Why not query the server more often, based on what the client really needs. You could even adjust the resolution, as you might not need all the points, depending on how large the chosen range is.
Nonetheless, if we can assume the array you got is already sorted by date, you can find your matching values faster with a Binary Search Algorithm.
This is how it would basically work: You start in the middle of your array, if the found number is higher than the one you are looking for, you inspect the left half of your array, if it is lower, you inspect the right one. Now you you inspect the middle of your new section. You continue with this pattern until you've found your value.
Visualization of the binary search algorithm where 4 is the target value.
The average performance of a Binary Search Algorithm is O(log n), so this would definitely help.
This is just a start to give you an idea. You will need to modify it a bit, to work with your range.
It seems django, or the sqlite database, is storing datetimes with microsecond precision. However when passing a time to javascript the Date object only supports milliseconds:
var stringFromDjango = "2015-08-01 01:24:58.520124+10:00";
var time = new Date(stringFromDjango);
$('#time_field').val(time.toISOString()); //"2015-07-31T15:24:58.520Z"
Note the 58.520124 is now 58.520.
This becomes an issue when, for example, I want to create a queryset for all objects with a datetime less than or equal to the time of an existing object (i.e. MyModel.objects.filter(time__lte=javascript_datetime)). By truncating the microseconds the object no longer appears in the list as the time is not equal.
How can I work around this? Is there a datetime javascript object that supports microsecond accuracy? Can I truncate times in the database to milliseconds (I'm pretty much using auto_now_add everywhere) or ask that the query be performed with reduced accuracy?
How can I work around this?
TL;DR: Store less precision, either by:
Coaxing your DB platform to store only miliseconds and discard any additional precision (difficult on SQLite, I think)
Only ever inserting values with the precision you want (difficult to ensure you've covered all cases)
Is there a datetime javascript object that supports microsecond accuracy?
If you encode your dates as Strings or Numbers you can add however much accuracy you'd like. There are other options (some discussed in this thread). Unless you actually want this accuracy though, it's probably not the best approach.
Can I truncate times in the database to milliseconds..
Yes, but because you're on SQLite it's a bit weird. SQLite doesn't really have dates; you're actually storing the values in either a text, real or integer field. These underlying storage classes dictate the precision and range of the values you can store. There's a decent write up of the differences here.
You could, for example, change your underlying storage class to integer. This would truncate dates stored in that field to a precision of 1 second. When performing your queries from JS, you could likewise truncate your dates using the Date.prototype.setMilliseconds() function. Eg..
MyModel.objects.filter(time__lte = javascript_datetime.setMilliseconds(0))
A more feature complete DB platform would handle it better. For example in PostgreSQL you can specify the precision stored more exactly. This will add a timestamp column with precision down to miliseconds (matching that of Javascript)..
alter table "my_table" add "my_timestamp" timestamp (3) with time zone
MySQL will let you do the same thing.
.. or ask that the query be performed with reduced accuracy?
Yeah but this is usually the wrong approach.
If the criteria you're filtering by is to precise then you're ok; you can truncate the value then filter (like in the ..setMilliseconds() example above). But if the values in the DB you're checking against are too precise you're going to have a Bad Time.
You could write a query such that the stored values are formatted or truncated to reduce their precision before being compared to your criteria but that operation is going to need to be performed for every value stored. This could be millions of values. What's more, because you're generating the values dynamically, you've just circumvented any indexes created against the stored values.