I am trying to do a simple forecast of future profit of an organization based on the past records by using regression. I am following this link. For testing purpose, I have changed the sample data and it produced these results:
My actual data will be date and profit, and they will be going up and down rather than in a contiguous increment way. I realized that the method above works for sample data which keep on increasing as the prediction is quite accurate. However, when I changed the data to the one in the screenshot which goes up and down crazily, the prediction is not so accurate anymore.
Just wondering if there is any way to increase the accuracy for the regression as my data will be going up and down.
Thanks!
When you do a regression you are fitting a model to the data. In other words you are saying "here is an equation that describes roughly how the data behaves". In the linear regression case the model / equation is:
y = a * x + b
Where x is the input and y is the output. By doing a linear regression you are saying "my data follows a straight line, here is my data, what are the parameters a and b that best fit the data?".
Obviously if your data does not follow a straight line this will work badly. For instance look at this image I found on Google Images.
Clearly you can see the data has some kind of complex wavy shape - it goes up and down and then up again. The linear model is not complex enough to express this shape (it can only do straight lines). So it doesn't fit well.
Since you need a more complex model you have to choose one. There are dozens of standard ones and you can make up your own. All the model is is an equation with some fixed parameters that can be adjusted so that the equation fits your data.
I suggest you play around with the trend line options in Excel or Google Sheets to get a feel for this. See the Trendline Types bit here for some common models.
Note that none of those will work well for monthly profit because none of them are really cyclical. You probably want a model that is a combination of some repeating multipliers to capture month-to-month variations, and then a linear or polynomial component to capture the fact that yearly profit is increasing or decreasing over time.
You don't want a model that is too expressive however, otherwise you will overfit the data (basically it will see patterns in the noise).
Related
I am trying to draw a visualization showing some trend over time.
In my line plot, I have date as the X variable, some other number as the Y variable. I used d3.time scale for x and d3.linear scale for Y. The line plot is fine.
Then I tried to draw a linear regression line, but I failed, because the data for x is not numerical. I searched and searched. This post has a nice adaptable regression code, but that's for numerical data; this post has a graph similar to what I'm shooting, but it uses ordinal scale. I am wondering if there is any simpler way to make the linear regression code reusable for my time series data (e.g., "09-Mar-2016"). Any advice?
I don't know anything about javascript, but I am quite familiar with this problem. One solution: convert those date-times into units from a known date-time, whether seconds, or hours, or days.
If your dataset supports it, take the max and min of your date-times, and subtract the min from each. If you are just stuck with text, you may have to parse the values and do your own calendar math. However, there must be a library to handle this.
I have a large set (>2000) of time series data that I'd like to display using d3 in the browser. D3 is working great for displaying a subset of the data (~100 points) to the user, but I also want a "context" view (like this) to show the entire data set and allow users to select as subregion to view in detail.
However, performance is abysmal when trying to display that many points in d3. I feel like a good solution would be to select a sample of the data and then use some kind of interpolation (spline, polynomial, etc., this is the part I know how to do) to draw a curve that is reasonably similar to the actual data.
However, it's not clear to me how I ought to go about selecting the subset. The data (shown below) has rather flat regions where fewer samples would be needed for a decent interpolation, and other regions where the absolute derivative is quite high, where more frequent sampling is needed.
To further complicate matters, the data has gaps (where the sensor generating it was failing or out of range), and I'd like to keep these gaps in the chart rather than interpolating through them. Detection of the gaps is fairly simple though, and simply clipping them out after drawing the entire data set with the interpolation seems like a reasonable solution.
I'm doing this in JavaScript, but a solution in any language or a mathematical answer to the problem would do.
You could use the d3fc-sample module, which provides a number of different algorithms for sampling data. Here's what the API looks like:
// Create the sampler
var sampler = fc_sample.largestTriangleThreeBucket();
// Configure the x / y value accessors
sampler.x(function (d) { return d.x; })
.y(function (d) { return d.y; });
// Configure the size of the buckets used to downsample the data.
sampler.bucketSize(10);
// Run the sampler
var sampledData = sampler(data);
You can see an example of it running on the website:
https://d3fc.io/examples/sample/
The largest-triangle three-buckets algorithm works quite well on data that is 'patchy'. It doesn't vary the bucket size, but does ensure that peaks / troughs are included, which results in a good representation of the sampled data.
I know this doesn't answer your question entirely, but this library might help you to simplify your line during rendering. Not sure if they handle data gaps though.
http://mourner.github.io/simplify-js/
My advice is to average (not subsample) over longer or shorter time intervals and plot those average values as horizontal bars. I think that's very comprehensible to the user -- if you try something fancier, you might give up the ability to explain exactly what's going on. I'm assuming you can let the user choose to zoom in or out so as to show more or less detail.
You might be able to get the database engine to compute averages over intervals for you, so that's a potential speed-up too.
As to the time intervals to pick, you could try either (1) fixed intervals such as 1 second, 15 seconds, 1 minute, 15 minutes, hours, days, or whatever; that might be easier for the user to understand, or (2) choose the interval to make a fixed number of units across the whole time range, e.g. if you decide to display 7 hours of data in 100 units, then each unit = 252 seconds.
I have phasor information (polar vector data pairs, each with magnitude and angle, representing voltage and current measurements) that I would like to display using Javascript. They should look something like the image linked below (my rep isn't high enough to directly post it) which I stole from Jesse's question about MatPlotLib. I would also like to easily change which phasors are displayed by a simple mechanic like clicking on the legend entry.
See a phasor diagram example here.
While I have inspected several code sets, I have yet to find a chart package that is built to handle polar vectors like this. Is my Google-fu lacking or do I need to create everything from scratch?
I feel like this is a cheap workaround, but here's what I ended up doing:
I used the polar chart from jqWidgets and with the series type set to "column" and the flip property switched to "true." I put the data in an array with 0 entries for each possible angle except for where I wanted the phasor displayed. Each phasor gets a dedicated series so the legend lists them all. It's not perfect and the array is much larger than it really should need to be, but it's passable.
While it's not surprising that no power system display package is publicly available for Javascript, I'm sure one has to be out there for educational sites if nothing else.
I have a geojson object defining Neighborhoods in Los Angeles using lon/lat polygons. In my web application, the client has to process a live stream of spatial events, basically a list of lon/lat coordinates. How can I classify these coordinates into neighborhoods using Javascript on the client (in the browser)?
I am willing to assume neighborhoods are exclusive. So once a coordinate as been classified as neighborhood X, there is no need to further test it for other neighborhoods.
There's a great set of answers here on how to solve the general problem of determining whether a point is contained by a polygon. The two options there that sound the most interesting in your case:
As #Bubbles mentioned, do a bounding box check first. This is very fast, and I believe should work fine with either projected or unprotected coordinates. If you have SVG paths for the neighborhoods, you can use the native .getBBox() method to quickly get the bounding box.
the next thing I'd try for complex polygons, especially if you can use D3 v3, is rendering to an off-screen canvas and checking pixel color. D3 v3 offers a geo path helper that can produce canvas paths as well as SVG paths, and I suspect if you can pre-render the neighborhoods this could be very fast indeed.
Update: I thought this was an interesting problem, so I came up with a generalized raster-based plugin here: http://bl.ocks.org/4246925
This works with D3 and a canvas element to do raster-based geocoding. Once the features are drawn to the canvas, the actual geocoding is O(1), so it should be very fast - a quick in-browser test could geocode 1000 points in ~0.5 sec. If you were using this in practice, you'd need to deal with edge-cases better than I do here.
If you're not working in a browser, you may still be able to do this with node-canvas.
I've seen a few libraries out there that do this, but most of them are canvas libraries that may rely on approximations more than you'd want, and might be hard to adapt to a project which has no direct need to rely on them for intersections.
The only other half-decent option I can think of is implementing ray casting in javascript. This algorithm isn't technically perfect since it's for Euclidean geometry and lat/long coordinates are not (as they denote points on a curved surface), but for areas as small as a neighbourhood in a city I doubt this will matter.
Here's a google maps extension that essentially does this algorithm. You'd have to adapt it a bit, but the principles are quite similar. The big thing is you'd have to preprocess your coordinates into paths of just two coordinates, but that should be doable.*
This is by no means cheap - for every point you have to classify, you must test every line segment in the neighborhood polygons. If you expect a user to be reusing the same coordinates over and over between sessions, I'd be tempted to store their neighborhood as part of it's data. Otherwise, if you are testing against many, many neighborhoods, there are a few simple timesavers you can implement. For example, you can preprocess every neighborhoods extreme coordinates (get their northmost, eastmost, southmost, and westmost points), and use these to define a rectangle that inscribes the town. Then, you can first check the points for candidate neighborhoods by checking if it lies inside the rectangle, then run the full ray casting algorithm.
*If you decide to go this route and have any trouble adapting this code, I'd be happy to help
Any ideas on where to even begin with making a violin chart using d3? Does it exist already?
I've looked around and have figured out how to do it using ggplot2 and was hoping there'd be a ready-made example that I could learn from but haven't found one yet.
I suppose I could do a really painful process of making various size bars on top of each other, or taking a distribution, rotating it and mirroring it. But surely there's a better way.
I needed that for myself so here it is: violin plot
As far as I know, nobody has done this before, but it shouldn't be too hard. I would start as if I was making a line chart (or boxed instead of lines) for one half of a violin. That is, create the appropriate x and y scales and add the data in. The result of this I would rotate and translate to the correct position. Then do the same thing again and mirror it as well to get the other half of the violin.
This may sound complex, but SVG has built-in support for these operations (rotating and mirroring). You should be able to approach this pretty much like drawing a line graph of the distribution with 2-3 simple operations on top of that. Wrap everything in a function and you've got something you can call to create a violin.
It of course also depends in what form you have the data to make the plot. A line plot might not be feasible because of too few data points, but then you can easily use bars instead.