I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?
1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?
Related
I am new to machine learning so still trying to wrap my head around concepts, please bear this in mind if my question may not be as concise as needed.
I am building a Tensorflow JS model with LSTM layers for time-series prediction (RNN).
The dataset used is pinged every few hundred milliseconds (at random intervals). However, the data produced can come in very wide ranges e.g. Majority of data received will be of value 20, 40, 45 etc. However sometimes this value will reach 75,000 at the extreme end.
So the data range is between 1 to 75,000.
When I normalise this data using a standard min/max method to produce a value between 0-1, the normalised data for the majority of data requests will be to many small significant decimal places. e.g.: '0.0038939328722009236'
So my question(s) are:
1) Is this min/max the best approach for normalising this type of data?
2) Will the RNN model work well with so many significant decimal places and precision?
3) Should I also be normalising the output label? (of which there will be 1 output)
Update
I have just discovered a very good resource on a google quick course, that delves into preparing data for ML. One technique suggested would be to 'clip' the data at the extremes. Thought I would add it here for reference: https://developers.google.com/machine-learning/data-prep
After doing a bit more research I think I have a decent solution now;
I will be performing two steps, with the first being to use 'quantile bucketing' (or sometimes called 'binning' ref: https://developers.google.com/machine-learning/data-prep/transform/bucketing).
Effectively it involves splitting the range of values down into smaller subset ranges, and applying an integer value to each smaller range of values. e.g. A initial range of 1 to 1,000,000 could be broken down into ranges of 100k. So 1 to 100,000 would be range number 1, 100,001 to 200,000 would be range number 2.
In order to have an even distribution of samples within each bucket range, due to the skewed dataset I have, I modify the subset ranges so they capture roughly the same samples in each 'bucket' range. For example, the first range of the example above, could be 1 to 1,000 instead of 1 to 100,000. The next bucket range would be 1,001 to 2,000. The third could be 2,001 to 10,000 and so on..
In my use case I ended up with 22 different bucket ranges. The next step is my own adaptation, since I don't want to have 22 different features (as seems to be suggested in the link). Instead, I apply the standard min/max scaling to these bucket ranges, resulting in the need for only 1 feature. This gives me the final result of normalised data between 0 and 1, which perfectly deals with my skewed dataset.
Now the lowest normalised value I get (other than 0) is 0.05556.
Hope this helps others.
There are two similar time series data sets that share a common measurement, that however comes from two completely different sources:
one is a classic GPS receiver (position, accurate time, somewhat accurate speed once per second) on a tractor
the other is a logger device on the same tractor that has an internal real time clock and measures speed/distance by measuring wheel rotation. That device also tracks other sensor data and stores these values once per second.
I'd like to combine these two data sources so that I can (as accurately as possible) combine GPS position and the additional logger sensor data.
One aspect that might help here is that there will be some significant speed variations during the measurement as the tractor usually has to do 180-degree turns after 100-200 meters (i.e good detail for a better matching).
By plotting the speed data into two charts respectively (X-axis is time, Y-axis is speed), I think a human could pretty easily align the charts in a way so that they match nicely.
How can this be done in software? I expect there are known algorithms that solve this but it's hard to search for it if you don't know the right terms...
Anyway, the algorithm probably has to deal with these problems:
the length of the data won't be equal (the two devices won't start/stop exactly at the same time - I expect at least 80% to overlap)
the GPS clock is of course perfectly accurate, but the realtime clock of the logger may be way off (it has no synchronized time source), so that I can't simply match the time
there might be slight variations in the measured speed due to the different measurement methods
A simple solution probably would find two matching extremes (allowing to interpolate the data in between), a better solution might be more flexible and even correct some "drifts"...
I'd like to find the local maxima for a set of data.
I have a log of flight data from a sounding rocket payload, and I'd like to find the approximate times for the staging based on accelerometer data. I should be able to get the times I want based on a visual inspection of the data on a graph, but how would I go about finding the points programmatically in Javascript?
If it's only necessary to know approximate times, probably it's good enough to use some heuristic such as: run the data through a smoothing filter and then look for jumps.
If it's important to find the staging times accurately, my advice is to construct a piecewise continuous model and fit that to the data, and then derive the staging times from that. For example, a one-stage model might be: for 0 < t < t_1, acceleration is f(t) - g; for t > t_1, acceleration is - g, where g is gravitational acceleration. I don't know what f(t) might be here but presumably it's well-known in rocket engineering. The difficulty of fitting such a model is due to the presence of the cut-off point t_1, which makes it nondifferentiable, but it's not really too difficult; in a relatively simple case like this, you can loop over the possible cut-off points and compute the least-squares solution for the rest of the parameters, then take the cut-off point or points which have the least error.
See Seber and Wild, "Nonlinear Regression"; there is a chapter about such models.
I am trying to generate the dataset for latency vs time dataset for graphing.
Dataset has thousands to couple million entries that looks like this:
[{"time_stamp":,"latency_axis":},...}
Querying and generating graphs has poor performance with full dataset for obvious reasons. An analogy is like loading large JPEG, first sample a rough fast image, then render the full image.
How can this be achieved from mongodb query efficiently while providing a rough but accurate representation of dataset with respect to time. My ideas is to organize documents into time buckets each covering a constant range of time, return the 1st, 2nd, 3rd results iteratively until all results finish loading. However, there is a no trivial or intuitive way to implement this prototype efficiently.
Can anyone provide guide/solution/suggestion regarding this situation?
Thanks
Notes for zamnuts:
A quick sample with very small dataset of what we are trying to achieve is: http://baker221.eecg.utoronto.ca:3001/pages/latencyOverTimeGraph.html
We like to dynamically focus on specific regions. Our goal is to get good representation but avoid loading all of the dataset, at the same time have reasonable performance.
DB: mongodb Server: node.js Client: angular + nvd3 directive (proof of concept, will likely switch to d3.js for more control).
I'm just starting to approach this problem, I want to to allow users to arbitrarily select ranges and filters that allow them to graph large data sets (realistically it should never be more than 10 million data points) on a web page. I use elasticsearch as the method of storing and aggregating the data, along with redis for keeping track of summary data, and d3.js is my graphing library.
My thoughts on the best solution is to have precalculated summaries in different groupings that can be used to graph from. So if the data points exist over several years, I can have groupings by month and day (which I would be doing anyway), but then by groupings of say half day, quarter day, hour, half hour, etc. And then before I query for graph data I do a quick calculation to see which of these groupings will give me some ideal number of data points (say 1000).
Is this a reasonable way to approach the problem? Is there a better way?
You should reconsider the amount of data...
Even in desktop plotting apps it is uncommon to show that many points per plot - e.g. origin prints a warning that it will show only a subset for performance reasons. you could for example throw away every 3rd point to make them less.
You should give the user the ability to zoom in or navigate around to explore the data, like pagination alike style ...
Grouping or faceting how it is called in Lucene community is of course possible with that many documents but be sure you have enough RAM+CPU
You can't graph (typically) more points than you have dots on your screen. So to graph 1M points you'd need a really good monitor.