What are good strategies for graphing large datasets (1M +)? - javascript

I'm just starting to approach this problem, I want to to allow users to arbitrarily select ranges and filters that allow them to graph large data sets (realistically it should never be more than 10 million data points) on a web page. I use elasticsearch as the method of storing and aggregating the data, along with redis for keeping track of summary data, and d3.js is my graphing library.
My thoughts on the best solution is to have precalculated summaries in different groupings that can be used to graph from. So if the data points exist over several years, I can have groupings by month and day (which I would be doing anyway), but then by groupings of say half day, quarter day, hour, half hour, etc. And then before I query for graph data I do a quick calculation to see which of these groupings will give me some ideal number of data points (say 1000).
Is this a reasonable way to approach the problem? Is there a better way?

You should reconsider the amount of data...
Even in desktop plotting apps it is uncommon to show that many points per plot - e.g. origin prints a warning that it will show only a subset for performance reasons. you could for example throw away every 3rd point to make them less.
You should give the user the ability to zoom in or navigate around to explore the data, like pagination alike style ...
Grouping or faceting how it is called in Lucene community is of course possible with that many documents but be sure you have enough RAM+CPU

You can't graph (typically) more points than you have dots on your screen. So to graph 1M points you'd need a really good monitor.

Related

Efficient/Performant way to visualise a lot of data in javascript + D3/mapbox

I am currently looking at an efficient way to visualise a lot of data in javascript. The data is geospatial and I have approximately 2 million data points.
Now I know that I cannot give that many datapoint to the browser directly otherwise it would just crash most of the time (or the response time will be very slow anyway).
I was thinking of having a javascript window communicating with a python which would do all the operations on the data and stream json data back to the javascript app.
My idea was to have the javascript window send in real time the bounding box of the map (lat and lng of north east and south west Point) so that the python script could go through all the entries before sending the json of only viewable objects.
I just did a very simple script that could do that which basically
Reads the whole CSV and store data in a list with lat, lng, and other attributes (2 or 3)
A naive implementation to check whether points are within the bounding box sent by the javascript.
Currently, going through all the datapoints takes approximately 15 seconds... Which is way too long, since I also have to then transform them into a geojson object before streaming them to my javascript application.
Now of course, I could first of all sort my points in ascending order of lat and lng so that the function checking if a point is within the javascript sent bounding box would be an order of magnitude faster. However, the processing time would still be too slow.
But even admitting that it is not, I still have the problem that at very low zoom levels, I would get too many points. Constraining the min_zoom_level is not really an option for me. So I was thinking that I should probably try and cluster data points.
My question is therefore do you think that this approach is the right one? If so, how does one compute the clusters... It seems to me that I would have to generate a lot of possible clusters (different zoom levels, different places on the map...) and I am not sure if this is an efficient and smart way to do that.
I would very much like to have your input on that, with possible adjustments or completely different solutions if you have some.
This is almost language agnostic, but I will tag as python since currently my server is running python script and I believe that python is quite efficient for large datasets.
Final note:
I know that it is possible to pre-compute tiles that I could just feed my javascript visualization but as I want to have interactive control over what is being displayed, this is not really an option for me.
Edit:
I know that, for instance, mapbox provides the clustering of data point to facilitate displaying something like a million data point.
However, I think (and this is related to an open question here
) while I can easily display clusters of points, I cannot possibly make a data-driven style for my cluster.
For instance, if we take the now famous example of ethnicity maps, if I use mapbox to cluster data points and a cluster is giving me 50 people per cluster, I cannot make the cluster the color of the most represented ethnicity in the sample of 50 people that it gathers.
Edit 2:
Also learned about supercluster, but I am quite unsure whether this tool could support multiple million data points without crashing either.

Downsampling Time Series: Average vs Largest-Triangle-Three-Buckets

I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?
1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?

how to appromimately align unprecise time series data / curves?

There are two similar time series data sets that share a common measurement, that however comes from two completely different sources:
one is a classic GPS receiver (position, accurate time, somewhat accurate speed once per second) on a tractor
the other is a logger device on the same tractor that has an internal real time clock and measures speed/distance by measuring wheel rotation. That device also tracks other sensor data and stores these values once per second.
I'd like to combine these two data sources so that I can (as accurately as possible) combine GPS position and the additional logger sensor data.
One aspect that might help here is that there will be some significant speed variations during the measurement as the tractor usually has to do 180-degree turns after 100-200 meters (i.e good detail for a better matching).
By plotting the speed data into two charts respectively (X-axis is time, Y-axis is speed), I think a human could pretty easily align the charts in a way so that they match nicely.
How can this be done in software? I expect there are known algorithms that solve this but it's hard to search for it if you don't know the right terms...
Anyway, the algorithm probably has to deal with these problems:
the length of the data won't be equal (the two devices won't start/stop exactly at the same time - I expect at least 80% to overlap)
the GPS clock is of course perfectly accurate, but the realtime clock of the logger may be way off (it has no synchronized time source), so that I can't simply match the time
there might be slight variations in the measured speed due to the different measurement methods
A simple solution probably would find two matching extremes (allowing to interpolate the data in between), a better solution might be more flexible and even correct some "drifts"...

Finding local maxima

I'd like to find the local maxima for a set of data.
I have a log of flight data from a sounding rocket payload, and I'd like to find the approximate times for the staging based on accelerometer data. I should be able to get the times I want based on a visual inspection of the data on a graph, but how would I go about finding the points programmatically in Javascript?
If it's only necessary to know approximate times, probably it's good enough to use some heuristic such as: run the data through a smoothing filter and then look for jumps.
If it's important to find the staging times accurately, my advice is to construct a piecewise continuous model and fit that to the data, and then derive the staging times from that. For example, a one-stage model might be: for 0 < t < t_1, acceleration is f(t) - g; for t > t_1, acceleration is - g, where g is gravitational acceleration. I don't know what f(t) might be here but presumably it's well-known in rocket engineering. The difficulty of fitting such a model is due to the presence of the cut-off point t_1, which makes it nondifferentiable, but it's not really too difficult; in a relatively simple case like this, you can loop over the possible cut-off points and compute the least-squares solution for the rest of the parameters, then take the cut-off point or points which have the least error.
See Seber and Wild, "Nonlinear Regression"; there is a chapter about such models.

mongodb: How to sample large data-set in a controlled way

I am trying to generate the dataset for latency vs time dataset for graphing.
Dataset has thousands to couple million entries that looks like this:
[{"time_stamp":,"latency_axis":},...}
Querying and generating graphs has poor performance with full dataset for obvious reasons. An analogy is like loading large JPEG, first sample a rough fast image, then render the full image.
How can this be achieved from mongodb query efficiently while providing a rough but accurate representation of dataset with respect to time. My ideas is to organize documents into time buckets each covering a constant range of time, return the 1st, 2nd, 3rd results iteratively until all results finish loading. However, there is a no trivial or intuitive way to implement this prototype efficiently.
Can anyone provide guide/solution/suggestion regarding this situation?
Thanks
Notes for zamnuts:
A quick sample with very small dataset of what we are trying to achieve is: http://baker221.eecg.utoronto.ca:3001/pages/latencyOverTimeGraph.html
We like to dynamically focus on specific regions. Our goal is to get good representation but avoid loading all of the dataset, at the same time have reasonable performance.
DB: mongodb Server: node.js Client: angular + nvd3 directive (proof of concept, will likely switch to d3.js for more control).

Categories

Resources