I have a very simple dataset, consisting of
a date (object) for every day over multiple years and
a value that belongs to a date.
I need to look at that data on the level of days, months and years (and flowing time intervals, like e.g. 30 days). The data is obtained from a csv file.
What would you suggest, is the best way, to (conceptually) prepare the data? Should I, when reading the data immediately nest it (year > month > day - value) and also calculate everything I need to plot (like averages and the likes) and save it with my data (e.g. data.year.monthXY.avg = z)?
Or should I leave the data as is (at its most basic form, a day and its value) and do all the calculations later in the script?
I have little experience in handling data and D3s best practices and would appreciate any advice you have for me on that topic.
I feel like, there is not need to prep the date, because all the info is already contained in the date object and the value, and all the calculation can be done on the fly. (I have to admit, though, that I don't know exactly how i can tell D3 to, say, create the average of a month, without first creating a new dataset that just includes this exact month. That might be another question though, depending how you suggest I should sort my data.) But on the other hand nesting data where possible seems like a good idea, in particular for making use of D3s strength in that regard and its benefits for other/future visual representations.
Related
I'm programming lines charts with the use of Flot Charts to display timeseries.
In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.
Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm:
http://flot.base.is/
What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?
To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?
1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.
So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.
Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.
The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.
2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.
Since your question is a bit old, what'd you end up doing?
I want to store an array of objects that contain the property event_timestamp that contains a Date() object of the timestamp of the event.
i'll like to provide a from Date object and a to Date object and to collect all the objects within that range.
I don't want to iterate over all the array, i'd like something faster... for example O(Log N) or anything else.
is there something like that implemented in javascript? is there a 3rd party library that provides the search by O(Log N) or any other fast method?
any information regarding the issue would be greatly appreciated.
thanks!
update
thanks for your responses so far.
my scenario is that i have a dashboard that contains graphs. now user can search for example.. give me the result of the last 15 minutes. then the server provide rows with the relevant results and each row in the result contains a property called event_timestamp that contains a Date Object.
now, if after 5 minutes the user searches for the last 15 minutes again, then it means that the first 10 minutes he already queried before, so I want to cache that and to send a request from the server only for the last 5 minutes.
so whenever the user queries the server, i take the response and parse it using the following steps:
get event_time on first row
get event_time of last row
append to an array the following object:
{
rows: rows,
fromDate: firstRowDate,
toDate: lastRowDate
}
now.. in the previous example the user first queries for 15 minutes, and after 5 minutes queried for 15 minutes again. which means i have 10 valuable minutes of data in the object.
the main issue here is how do i iterate over the rows fast to find the relevant rows that i need to use for my graph.
in a simple way i can just iterate through all the rows and find the ones that are in the range of fromDate to toDate. but if i'll deal with a few million rows that's gonna be a hassle for the client.
I use google flat buffer to transfer the data and then I build the rows by myself. so I can save them in any other format.
hope this helps understand my needs. thanks!
With your current way of storage there is no other or faster way than to iterate over the whole array. So your complexity will be O(n) just for traversing the array.
The information you need to access is stored inside an object which itself is stored in the array. How else would you be able to access the information than to look at each single object. Even in other languages with different array implementations, there wouldn't be a faster way.
The only way you could make this faster is by changing the way you store your data.
For example you could implement a BST with the date as the index. As this would be sorted by date, you wouldn't have to traverse the whole tree and you would need fewer operations to find the nodes in range. I'm afraid you would have to build it yourself though.
Update:
This was related to your initial question. Your update goes in a whole new direction.
I don't think your approach is a good idea. You won't be able to handle such a big amount of data, at least not in a performant way. Why not query the server more often, based on what the client really needs. You could even adjust the resolution, as you might not need all the points, depending on how large the chosen range is.
Nonetheless, if we can assume the array you got is already sorted by date, you can find your matching values faster with a Binary Search Algorithm.
This is how it would basically work: You start in the middle of your array, if the found number is higher than the one you are looking for, you inspect the left half of your array, if it is lower, you inspect the right one. Now you you inspect the middle of your new section. You continue with this pattern until you've found your value.
Visualization of the binary search algorithm where 4 is the target value.
The average performance of a Binary Search Algorithm is O(log n), so this would definitely help.
This is just a start to give you an idea. You will need to modify it a bit, to work with your range.
I am trying to generate the dataset for latency vs time dataset for graphing.
Dataset has thousands to couple million entries that looks like this:
[{"time_stamp":,"latency_axis":},...}
Querying and generating graphs has poor performance with full dataset for obvious reasons. An analogy is like loading large JPEG, first sample a rough fast image, then render the full image.
How can this be achieved from mongodb query efficiently while providing a rough but accurate representation of dataset with respect to time. My ideas is to organize documents into time buckets each covering a constant range of time, return the 1st, 2nd, 3rd results iteratively until all results finish loading. However, there is a no trivial or intuitive way to implement this prototype efficiently.
Can anyone provide guide/solution/suggestion regarding this situation?
Thanks
Notes for zamnuts:
A quick sample with very small dataset of what we are trying to achieve is: http://baker221.eecg.utoronto.ca:3001/pages/latencyOverTimeGraph.html
We like to dynamically focus on specific regions. Our goal is to get good representation but avoid loading all of the dataset, at the same time have reasonable performance.
DB: mongodb Server: node.js Client: angular + nvd3 directive (proof of concept, will likely switch to d3.js for more control).
I have a Google Chart's ColumnChart in a Rails project. This is generated and populated in JavaScript, by calling a Rails controller action which renders JSON.
The chart displays a month's worth of information for a customer.
Above the chart I have next and previous arrows which allow a customer to change the month displayed on the chart. These don't have any functionality as it stands.
My question is, what is the best way to save the state of the chart, in terms of it's current month for a customer viewing the chart.
Here is how the I was thinking of doing the workflow:
One of the arrows is selected.
This event is captured in JavaScript.
Another request to the Rails action rendering JSON is performed with an additional GET parameter
passed, based on an data attribute of the arrow button (Either + or - ).
The chart is re-rendered using the new JSON response.
Would the logic around incrementing or decrementing the graphs current date be performed on the server side? With the chart's date being stored in a session array defaulting to the current date on first load?
On the other hand would it make sense to save the chart state on the client side within the JavaScript code or in cookie, then manipulate the date before it's sent to the Rails controller?
I've been developing with Rails for about 6 months and feel comfortable with it, but have only just recently started developing with JavaScript, using AJAX. My experience tying JS code together with Rails is some what limited at this point, so looking for some advice/best practices about how to approach this.
Any advice is much appreciated.
I'm going to go through a couple of options, some good, some bad.
First, what you definitely don't want to do is maintain any notion of what month you are in in cookies or any other form of persistent server-side storage. Certainly sometimes server state is necessary, but it shouldn't be used when their are easy alternatives. Part of REST (which Rails is largely built around) is trying to represent data in pure attributes rather than letting it's state be spread around like that.
From here, most solutions are probably acceptable, and opinion plays a greater role. One thing you could do is calculate a month from the +/- sign using the current month and send that to the server, which will return the information for the month requested.
I'm not a huge fan of this though, as you have to write javascript that's capable of creating valid date ranges, and most of this functionality will probably be on the server already. Just passing a +/- and the current month to the server will work as well, you'll just have to do a bit of additional routing and logic to resolve the sign on the server to a different month.
While either of these would work, my preferred solution would instead have the initial request for the month generate valid representations of the neighbouring months, and returning this to the client. Then, when you update the graph with the requested data, you also replace the forward/backward links on the graph with the ones provided by the server. This provides a nice fusion of the benefits of the prior two solutions - no additional routing on the server, and no substantive addition to the client-side code. Also, you have the added benefit of being able to grey out transitions to months where no data was collected from the client (i.e. before they were a customer and the future). Without this, you'd have to create separate logic to handle client requests for information that doesn't exist, which is extra work for you and more confusion for the customer.
As part of a small project I'm working on, I need to be able to parse a string into a custom object, which represents an action, date and a few other properties. The tricky part is that the input string can come in a variety of flavors that all need to be properly parsed.
Input strings may be in the following formats:
Go to work tomorrow at 9am
Wash my car on Monday, at 3 pm
Call the doctor next Tuesday at 10am
Fill out the rebate form in 3 days at 2:30pm
Wake me up every day at 7:00am
And the output object would look something like this:
{
"Action":"Wash my car",
"DateTime":"2011-12-26 3:00PM", // Format is irrelevant at this point
"Recurring":False,
"RecurranceType":""
}
At first I thought of constructing some sort of tree to represent different states (On, In, Every, etc.) with different outcomes and further states (candidate for a state machine, right?). However, the more I thought about this, the more it started looking like a grammar parsing problem. Due to a (limited) number of ways the sentence could be formed, it looks like some sort of grammar parsing algorithm would need to be implemented.
In addition, I'm doing this on the front end, so JavaScript is the language of choice here. Back end will be written in Python and could be used by calling AJAX methods, if necessary, but I'd prefer to keep it all in JavaScript. (To be honest, I don't think the language is a big issue here).
So, am I in way over my head? I have a strong JavaScript background, but nothing beyond school courses when it comes to language design, parsing, etc. Is there a better way to solve this problem? Any suggestions are greatly appreciated.
I don't know a lot about grammar parsing, but something here might help.
My first thought is that your sentence syntax seems to be pretty consistent
1st 3-4 words are generally VERB text NOUN, followed by some form of time. If the total options are limited to what form the sentence can take, you can hard-code some parsing rules.
I also ran across a couple of js grammar parsers that might get you somewhere:
http://jscc.jmksf.com/
http://pegjs.majda.cz/
http://www.corion.net/perl-dev/Javascript-Grammar.html
This is an interesting problem you have. Please update this with your solutions later.