Get the Population Standard Deviation of streaming input data - javascript

So let's say I have a sensor that's giving me a number, let's say the local temperature, or whatever really, every 1/100th of a second.
So in a second I've filled up an array with a hundred numbers.
What I want to do is, over time, create a statistical model, most likely a bell curve, of this streaming data so I can get the population standard deviation of this data.
Now on a computer with a lot of storage, this won't be an issue, but on something small like a raspberry pi, or any microprocessor, storing all the numbers generated from multiple sensors over a period of months becomes very unrealistic.
When I looked at the math of getting the standard deviation, I thought of simply storing a few numbers:
The total running sum of all the numbers so far, the count of numbers, and lastly a running sum of (each number - the current mean)^2.
Using this, whenever I would get a new number, I would simply add one to the count, add the number to the running sum, get the new mean, add the (new number - new mean)^2 to the running sum, divide that by the count and root that, to get the new standard deviation.
There are a few problems with this approach, however:
It would take 476 years to overflow the sum of numbers streaming in assuming the data type is temperature and the average temperature is 60 degrees Fahrenheit and the numbers are streamed at a 100hz.
The same level of confidence cannot be held for the sum of the (number - mean)^2 since it is a sum of squared numbers.
Most importantly, this approach is highly inaccurate since for each number a new mean is used, which completely obliterates the entire mathematical value of a standard deviation, especially a population standard deviation.
If you believe a population standard deviation is impossible to achieve, then how should I go about a sample standard deviation? Taking every nth number will still result in the same problems.
I also don't want to limit my data set to a time interval, (ie. a model for only the last 24 hours of sensor data is made) since I want my statistical model to be representative of the sensor data over a long period of time, namely a year, and if I have to wait a year to do testing and debugging or even getting a useable model, I won't be having fun.
Is there any sort of mathematical work around to get a population, or at least a sample standard deviation of an ever increasing set of numbers, without actually storing that set since that would be impossible, and still being able to accurately detect when something is multiple standard deviations away?
The closest answer I've seen is: wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm, however, but I have no idea what this is saying and if this requires storage of the set of numbers.
Thank you!

Link shows code, and it is clear that you need to store only 3 variables: number of samples so far, current mean and sum of quadratic differences

Related

Imitate high CPU and memory usage in NodeJS

I want to check the autoscaling behavior of our NestJS/NodeJS application inside a cluster. I want to generate CPU and/or memory usage >70% based on requests count per second.
I have tried to accumulate multiplications of random numbers on every request for 1 second but seems like requests are processed one by one and never generate too much load.
What would be your suggestion?
If you are trying to basically drain the computer using nodejs, the following works for me:
Generate way too many random numbers and multiply them with a VERY LARGE random number
Put all of the numbers above in an array and sort it
Parse the list into integers
Filter the list to have only primes remaining
That should be enough to take up memory (as the lists aren't recycled by the GC so quickly) and the sorting and prime bits should lag out the proccess. If its too much just scale down the array size. :)

What would be the best approach to normalise data for an LSTM model (using Tensorflow) with this wide range of values?

I am new to machine learning so still trying to wrap my head around concepts, please bear this in mind if my question may not be as concise as needed.
I am building a Tensorflow JS model with LSTM layers for time-series prediction (RNN).
The dataset used is pinged every few hundred milliseconds (at random intervals). However, the data produced can come in very wide ranges e.g. Majority of data received will be of value 20, 40, 45 etc. However sometimes this value will reach 75,000 at the extreme end.
So the data range is between 1 to 75,000.
When I normalise this data using a standard min/max method to produce a value between 0-1, the normalised data for the majority of data requests will be to many small significant decimal places. e.g.: '0.0038939328722009236'
So my question(s) are:
1) Is this min/max the best approach for normalising this type of data?
2) Will the RNN model work well with so many significant decimal places and precision?
3) Should I also be normalising the output label? (of which there will be 1 output)
Update
I have just discovered a very good resource on a google quick course, that delves into preparing data for ML. One technique suggested would be to 'clip' the data at the extremes. Thought I would add it here for reference: https://developers.google.com/machine-learning/data-prep
After doing a bit more research I think I have a decent solution now;
I will be performing two steps, with the first being to use 'quantile bucketing' (or sometimes called 'binning' ref: https://developers.google.com/machine-learning/data-prep/transform/bucketing).
Effectively it involves splitting the range of values down into smaller subset ranges, and applying an integer value to each smaller range of values. e.g. A initial range of 1 to 1,000,000 could be broken down into ranges of 100k. So 1 to 100,000 would be range number 1, 100,001 to 200,000 would be range number 2.
In order to have an even distribution of samples within each bucket range, due to the skewed dataset I have, I modify the subset ranges so they capture roughly the same samples in each 'bucket' range. For example, the first range of the example above, could be 1 to 1,000 instead of 1 to 100,000. The next bucket range would be 1,001 to 2,000. The third could be 2,001 to 10,000 and so on..
In my use case I ended up with 22 different bucket ranges. The next step is my own adaptation, since I don't want to have 22 different features (as seems to be suggested in the link). Instead, I apply the standard min/max scaling to these bucket ranges, resulting in the need for only 1 feature. This gives me the final result of normalised data between 0 and 1, which perfectly deals with my skewed dataset.
Now the lowest normalised value I get (other than 0) is 0.05556.
Hope this helps others.

Downsampling Time Series: Average vs Largest-Triangle-Three-Buckets

I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?
1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?

Keeping microsecond precision when passing datetime between django and javascript

It seems django, or the sqlite database, is storing datetimes with microsecond precision. However when passing a time to javascript the Date object only supports milliseconds:
var stringFromDjango = "2015-08-01 01:24:58.520124+10:00";
var time = new Date(stringFromDjango);
$('#time_field').val(time.toISOString()); //"2015-07-31T15:24:58.520Z"
Note the 58.520124 is now 58.520.
This becomes an issue when, for example, I want to create a queryset for all objects with a datetime less than or equal to the time of an existing object (i.e. MyModel.objects.filter(time__lte=javascript_datetime)). By truncating the microseconds the object no longer appears in the list as the time is not equal.
How can I work around this? Is there a datetime javascript object that supports microsecond accuracy? Can I truncate times in the database to milliseconds (I'm pretty much using auto_now_add everywhere) or ask that the query be performed with reduced accuracy?
How can I work around this?
TL;DR: Store less precision, either by:
Coaxing your DB platform to store only miliseconds and discard any additional precision (difficult on SQLite, I think)
Only ever inserting values with the precision you want (difficult to ensure you've covered all cases)
Is there a datetime javascript object that supports microsecond accuracy?
If you encode your dates as Strings or Numbers you can add however much accuracy you'd like. There are other options (some discussed in this thread). Unless you actually want this accuracy though, it's probably not the best approach.
Can I truncate times in the database to milliseconds..
Yes, but because you're on SQLite it's a bit weird. SQLite doesn't really have dates; you're actually storing the values in either a text, real or integer field. These underlying storage classes dictate the precision and range of the values you can store. There's a decent write up of the differences here.
You could, for example, change your underlying storage class to integer. This would truncate dates stored in that field to a precision of 1 second. When performing your queries from JS, you could likewise truncate your dates using the Date.prototype.setMilliseconds() function. Eg..
MyModel.objects.filter(time__lte = javascript_datetime.setMilliseconds(0))
A more feature complete DB platform would handle it better. For example in PostgreSQL you can specify the precision stored more exactly. This will add a timestamp column with precision down to miliseconds (matching that of Javascript)..
alter table "my_table" add "my_timestamp" timestamp (3) with time zone
MySQL will let you do the same thing.
.. or ask that the query be performed with reduced accuracy?
Yeah but this is usually the wrong approach.
If the criteria you're filtering by is to precise then you're ok; you can truncate the value then filter (like in the ..setMilliseconds() example above). But if the values in the DB you're checking against are too precise you're going to have a Bad Time.
You could write a query such that the stored values are formatted or truncated to reduce their precision before being compared to your criteria but that operation is going to need to be performed for every value stored. This could be millions of values. What's more, because you're generating the values dynamically, you've just circumvented any indexes created against the stored values.

Representing Ranges of Time with Different Units in a Comparable Way

I have some Event objects that have a lifespan attribute. I have varying types of events, each with different lifespans and potentially different units (e.g. 4-10 hours, 5-8 weeks, 1 month - 2 years). I want to store these ranges as a uniform and comparable datatype but I'm not sure what my best option is. Ideally I want to be able to go through a list of Events and find all that can last for 3 hours, 1 week, 2 months, etc.
The problem you are having is comparing values that are the same "substance" (i.e., time) yet have different units. This would be a great candidate for a value object. Value objects are not a common pattern in JavaScript, but they are no less applicable and are a great place for domain logic. This article is a great look at using value objects with JavaScript.
Create a "class" by defining a JavaScript function.
In it, store a value for lifespan in the smallest units that you use (hours).
Create several internal methods that will get and set this value based on the unit type (for example, multiplying weeks by 168 to get number of hours). You may want to round to the nearest week/month/year when getting in those units.
Create methods to compare the value in hours with the value in hours of another instance.
This will allow you to compare events independently of storing the unit of time that their lifespan is.

Categories

Resources