mongodb: How to sample large data-set in a controlled way

mongodb: How to sample large data-set in a controlled way - javascript

I am trying to generate the dataset for latency vs time dataset for graphing.
Dataset has thousands to couple million entries that looks like this:
[{"time_stamp":,"latency_axis":},...}
Querying and generating graphs has poor performance with full dataset for obvious reasons. An analogy is like loading large JPEG, first sample a rough fast image, then render the full image.
How can this be achieved from mongodb query efficiently while providing a rough but accurate representation of dataset with respect to time. My ideas is to organize documents into time buckets each covering a constant range of time, return the 1st, 2nd, 3rd results iteratively until all results finish loading. However, there is a no trivial or intuitive way to implement this prototype efficiently.
Can anyone provide guide/solution/suggestion regarding this situation?
Thanks
Notes for zamnuts:
A quick sample with very small dataset of what we are trying to achieve is: http://baker221.eecg.utoronto.ca:3001/pages/latencyOverTimeGraph.html
We like to dynamically focus on specific regions. Our goal is to get good representation but avoid loading all of the dataset, at the same time have reasonable performance.
DB: mongodb Server: node.js Client: angular + nvd3 directive (proof of concept, will likely switch to d3.js for more control).

Related

Downsampling Time Series: Average vs Largest-Triangle-Three-Buckets

I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?

1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?

ReactJS manipulating large array of objects

I am relatively new to ReactJS and web scene so I am turning to you, SO.
Our app is working with Firebase to store data. Recently I created a little stress test for our app which is loading about 14k entries from database and saving all of it in array. To be exact, we are downloading some data that we will use to represent different kind of charts and display some statistics about them. 14k, number of entries, is here just to test the speed of everything, we are actually not expecting that many entries per customer (at least not yet). Anyway I decided to think about future and how I should handle this kind of problems.
I did some rough calculations about loading times:
4-6s from Firebase into array
3-4s generating dates for chart and checking for matching dates (fetched data with generated dates) and calculating data
Everything, from saving results from database on is done using array.map. I have also tried using while / for, since I read some articles that .map is not very good when it comes to performance. The result was the same.
So what am I asking here is what kind of "system" should I use if I want to be prepared for "larger" amount of data in arrays? Can .map handle that much data? What are the best practices?

Aggregating tabular data in Javascript

I have a lot of data which is date based in nature but with irregular reporting intervals. What I was hoping to do is to have my PHP backend send the data to the frontend JS/jQuery to display but in order to report effectively I need to be able to report by "week", "month", etc. I could try and do the SQL on the backend and structure different data series for each interval but that would be CPU intensive on the backend and I'd like the frontend to be a bit dynamic in being able to move between these intervals (aka, "show me monthly, no wait, show me weekly", etc.)
What I'm looking for -- I think -- is a JS/jQuery library that will help me take tabular data and group/aggregate it based on date based conditions. If need be I could try adding a column to the tabular structure which specifies the week number (and thereby simplifies the data math on frontend). In any case, I'm flexible at the moment and just hoping to hear of some good resources or approaches that have been tried before to solve this kind of problem.
Note: I am using jqGrid for tabular views on the frontend and Highcharts for graphical presentation. This is potentially not important but it might also open some creative alternatives.

Processing a large (12K+ rows) array in JavaScript

The project requirements are odd for this one, but I'm looking to get some insight...
I have a CSV file with about 12,000 rows of data, approximately 12-15 columns. I'm converting that to a JSON array and loading it via JSONP (has to run client-side). It takes many seconds to do any kind of querying on the data set to returned a smaller, filtered data set. I'm currently using JLINQ to do the filtering, but I'm essentially just looping through the array and returning a smaller set based on conditions.
Would webdb or indexeddb allow me to do this filtering significantly faster? Any tutorials/articles out there that you know of that tackles this particular type of issue?

http://square.github.com/crossfilter/ (no longer maintained, see https://github.com/crossfilter/crossfilter for a newer fork.)
Crossfilter is a JavaScript library for exploring large multivariate
datasets in the browser. Crossfilter supports extremely fast (<30ms)
interaction with coordinated views, even with datasets containing a
million or more records...

This reminds me of an article John Resig wrote about dictionary lookups (a real dictionary, not a programming construct).
http://ejohn.org/blog/dictionary-lookups-in-javascript/
He starts with server side implementations, and then works on a client side solution. It should give you some ideas for ways to improve what you are doing right now:
Caching
Local Storage
Memory Considerations

If you require loading an entire data object into memory before you apply some transform on it, I would leave IndexedDB and WebSQL out of the mix as they typically both add to complexity and reduce the performance of apps.
For this type of filtering, a library like Crossfilter will go a long way.
Where IndexedDB and WebSQL can come into play in terms of filtering is when you don't need to load, or don't want to load, an entire dataset into memory. These databases are best utilized for their ability to index rows (WebSQL) and attributes (IndexedDB).
With in browser databases, you can stream data into a database one record at a time and then cursor through it, one record at a time. The benefit here for filtering is that this you means can leave your data on "disk" (a .leveldb in Chrome and .sqlite database for FF) and filter out unnecessary records either as a pre-filter step or filter in itself.

What are good strategies for graphing large datasets (1M +)?

I'm just starting to approach this problem, I want to to allow users to arbitrarily select ranges and filters that allow them to graph large data sets (realistically it should never be more than 10 million data points) on a web page. I use elasticsearch as the method of storing and aggregating the data, along with redis for keeping track of summary data, and d3.js is my graphing library.
My thoughts on the best solution is to have precalculated summaries in different groupings that can be used to graph from. So if the data points exist over several years, I can have groupings by month and day (which I would be doing anyway), but then by groupings of say half day, quarter day, hour, half hour, etc. And then before I query for graph data I do a quick calculation to see which of these groupings will give me some ideal number of data points (say 1000).
Is this a reasonable way to approach the problem? Is there a better way?

You should reconsider the amount of data...
Even in desktop plotting apps it is uncommon to show that many points per plot - e.g. origin prints a warning that it will show only a subset for performance reasons. you could for example throw away every 3rd point to make them less.
You should give the user the ability to zoom in or navigate around to explore the data, like pagination alike style ...
Grouping or faceting how it is called in Lucene community is of course possible with that many documents but be sure you have enough RAM+CPU

You can't graph (typically) more points than you have dots on your screen. So to graph 1M points you'd need a really good monitor.

Develop Reference

JavaScript is the programming language of the Web.