Linear regression with time series data

Linear regression with time series data - javascript

I am trying to draw a visualization showing some trend over time.
In my line plot, I have date as the X variable, some other number as the Y variable. I used d3.time scale for x and d3.linear scale for Y. The line plot is fine.
Then I tried to draw a linear regression line, but I failed, because the data for x is not numerical. I searched and searched. This post has a nice adaptable regression code, but that's for numerical data; this post has a graph similar to what I'm shooting, but it uses ordinal scale. I am wondering if there is any simpler way to make the linear regression code reusable for my time series data (e.g., "09-Mar-2016"). Any advice?

I don't know anything about javascript, but I am quite familiar with this problem. One solution: convert those date-times into units from a known date-time, whether seconds, or hours, or days.
If your dataset supports it, take the max and min of your date-times, and subtract the min from each. If you are just stuck with text, you may have to parse the values and do your own calendar math. However, there must be a library to handle this.

Related

Simple Regression Prediction Algorithm in JavaScript

I am trying to do a simple forecast of future profit of an organization based on the past records by using regression. I am following this link. For testing purpose, I have changed the sample data and it produced these results:
My actual data will be date and profit, and they will be going up and down rather than in a contiguous increment way. I realized that the method above works for sample data which keep on increasing as the prediction is quite accurate. However, when I changed the data to the one in the screenshot which goes up and down crazily, the prediction is not so accurate anymore.
Just wondering if there is any way to increase the accuracy for the regression as my data will be going up and down.
Thanks!

When you do a regression you are fitting a model to the data. In other words you are saying "here is an equation that describes roughly how the data behaves". In the linear regression case the model / equation is:
y = a * x + b
Where x is the input and y is the output. By doing a linear regression you are saying "my data follows a straight line, here is my data, what are the parameters a and b that best fit the data?".
Obviously if your data does not follow a straight line this will work badly. For instance look at this image I found on Google Images.
Clearly you can see the data has some kind of complex wavy shape - it goes up and down and then up again. The linear model is not complex enough to express this shape (it can only do straight lines). So it doesn't fit well.
Since you need a more complex model you have to choose one. There are dozens of standard ones and you can make up your own. All the model is is an equation with some fixed parameters that can be adjusted so that the equation fits your data.
I suggest you play around with the trend line options in Excel or Google Sheets to get a feel for this. See the Trendline Types bit here for some common models.
Note that none of those will work well for monthly profit because none of them are really cyclical. You probably want a model that is a combination of some repeating multipliers to capture month-to-month variations, and then a linear or polynomial component to capture the fact that yearly profit is increasing or decreasing over time.
You don't want a model that is too expressive however, otherwise you will overfit the data (basically it will see patterns in the noise).

D3 timeScale remove month boundary tick

I'm trying to build a box plot representing stats of a lot of data(x-time, y-time) The idea is to use d3.extent() to get the domain and then feed it to timeScale to get ticks in nicely chosen intervals. Then I calculate stats for data points between the ticks. Example below:
The approach works well most of the time. Except for the 1st day of the month which creates an additional tick in an uneven spacing ( see above "Wed 31, June" ). I'm currently calculating bar widths and positions manually assuming that the ticks are spaced evenly so the whole chart gets broken.
The data can span any range of time from a week to couple years, so dynamic ticks are required.
Any ideas how to remove the additional tick or make all the ticks spread evenly?
EDIT. Not sure if removing the tick is going to do the trick... probably even spacing is more important here. I looked up other people's questions about similar problems but I'm still confused how to solve my problem.
I've just started using D3 so forgive me if the question is trivial. I'll be happy if you just point me in the right direction.

Reduce the size of a large data set by sampling/interpolation to improve chart performance

I have a large set (>2000) of time series data that I'd like to display using d3 in the browser. D3 is working great for displaying a subset of the data (~100 points) to the user, but I also want a "context" view (like this) to show the entire data set and allow users to select as subregion to view in detail.
However, performance is abysmal when trying to display that many points in d3. I feel like a good solution would be to select a sample of the data and then use some kind of interpolation (spline, polynomial, etc., this is the part I know how to do) to draw a curve that is reasonably similar to the actual data.
However, it's not clear to me how I ought to go about selecting the subset. The data (shown below) has rather flat regions where fewer samples would be needed for a decent interpolation, and other regions where the absolute derivative is quite high, where more frequent sampling is needed.
To further complicate matters, the data has gaps (where the sensor generating it was failing or out of range), and I'd like to keep these gaps in the chart rather than interpolating through them. Detection of the gaps is fairly simple though, and simply clipping them out after drawing the entire data set with the interpolation seems like a reasonable solution.
I'm doing this in JavaScript, but a solution in any language or a mathematical answer to the problem would do.

You could use the d3fc-sample module, which provides a number of different algorithms for sampling data. Here's what the API looks like:
// Create the sampler
var sampler = fc_sample.largestTriangleThreeBucket();
// Configure the x / y value accessors
sampler.x(function (d) { return d.x; })
.y(function (d) { return d.y; });
// Configure the size of the buckets used to downsample the data.
sampler.bucketSize(10);
// Run the sampler
var sampledData = sampler(data);
You can see an example of it running on the website:
https://d3fc.io/examples/sample/
The largest-triangle three-buckets algorithm works quite well on data that is 'patchy'. It doesn't vary the bucket size, but does ensure that peaks / troughs are included, which results in a good representation of the sampled data.

I know this doesn't answer your question entirely, but this library might help you to simplify your line during rendering. Not sure if they handle data gaps though.
http://mourner.github.io/simplify-js/

My advice is to average (not subsample) over longer or shorter time intervals and plot those average values as horizontal bars. I think that's very comprehensible to the user -- if you try something fancier, you might give up the ability to explain exactly what's going on. I'm assuming you can let the user choose to zoom in or out so as to show more or less detail.
You might be able to get the database engine to compute averages over intervals for you, so that's a potential speed-up too.
As to the time intervals to pick, you could try either (1) fixed intervals such as 1 second, 15 seconds, 1 minute, 15 minutes, hours, days, or whatever; that might be easier for the user to understand, or (2) choose the interval to make a fixed number of units across the whole time range, e.g. if you decide to display 7 hours of data in 100 units, then each unit = 252 seconds.

Char library which can scale the x axis correctly

I am looking for a chart library in JavaScript.
It have to support Lines (I suppose all charting libraries do that).
I have to support zooming, due to high amount of data.
The problem I have found while using other libraries is scaling the x axis.
I get data by strings:
y=[43,56,34,63....]
x=[24/04/12 22:47,...]
But the number of lines and the interval is specified by the user. Meaning that I can have 50 data or 500 data. The problem comes when I input these dates and times. I cant find a library that will look into the length of the string and then just show maybe 4-5 of them when zoomed out, and show more detailed when zoomed in.
Money is not a problem, but it need to have a trial version.
Edit: I have tried libraries which allow me to set a start date, and then the interval by the points. But my intervals are not constant, so that cant be used either.

Try amCharts. This library supports dates as series. Your job will be to convert your date string into JavaScript Date Object, it's quite a simple task. Here is an example of a chart with date-based data:
http://amcharts.com/javascript/line-chart-with-date-based-data/
Another one with date/time based data:
http://amcharts.com/javascript/area-chart-with-time-based-data/
You can download and try this library.

Format Y axis tick mark numbers

I am using the annotated timeline visualization and I am wondering if
it is possible to format the numbers that are on the side of the Y
axis ( on the Y-axis 'tick marks').
What I need explicitly is a way for these numbers to be truncated to
whole integers, and not display the decimal parts, so they do not take
as much space.
I've gone through the API documentation and there doesn't seem to be a way to do this. Am I wrong?
Thank you,

Develop Reference

JavaScript is the programming language of the Web.