How much data (how many records) can vega handle without any noticeable delay in response to signals created for interactivity.
I couldn't use the URL because of cross origin problems, but created JSON for 60000 records and pasted it in my vega specifications.
I have 4 signals in total - 3 for single value filtering on mouseclick and 1 for range filtering using click and drag select. The dashboard created responds to each signal trigger after nearly 30 seconds.
So I wanted to know the maximum amount of data that can be used in vega and also any alternatives like interfacing vega with something to speed up the process? Any help will be appreciated.
Related
I'm looking for suggestions on how to go about handling the following use case scenario with python django framework, i'm also open to using javascript libraries/ajax.
I'm working with pre-existing table/model called revenue_code with over 600 million rows of data.
The user will need to search three fields within one search (code, description, room) and be able to select multiple search results similar to kendo controls multi select. I first started off by combining the codes in django-filters as shown below, but my application became unresponsive, after waiting 10-15 minutes i was able to view the search results but couldn't select anything.
https://simpleisbetterthancomplex.com/tutorial/2016/11/28/how-to-filter-querysets-dynamically.html
I've also tried to use kendo controls, select2, and chosen because i need the user to be able to select as many rev codes as they need upward to 10-20, but all gave the same unresponsive page when it attempted to load the data into the control/multi-select.
Essentially what I'm looking for is something like this below, which allows the user to select multiple selections and will handle a massive amount of data without becoming unresponsive? Ideally i'd like to be able to query my search without displaying all the data.
https://petercuret.com/add-ajax-to-django-without-writing-javascript/
Is Django framework meant to handle this type of volume. Would it be better to export this data into a file and read the file? I'm not looking for code, just some pointers on how to handle this use case.
What the basic mechanism of "searching 600 millions"? Basically how database do that is to build an index, before search-time, and sufficiently general enough for different types of query, and then at search time you just search on the index - which is much smaller (to put into memory) and faster. But no matter what, "searching" by its nature, have no "pagination" concept - and if 600 millions record cannot go into memory at the same time, then multiple swapping out and in of parts of the 600 millions records is needed - the more parts then the slower the operation. These are hidden behind the algorithms in databases like MySQL etc.
There are very compact representation like bitmap index which can allow you to search on data like male/female very fast, or any data where you can use one bit per piece of information.
So whether Django or not, does not really matters. What matters is the tuning of database, the design of tables to facilitate the queries (types of indices), and the total amount of memory at server end to keep the data in memory.
Check this out:
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
https://serverfault.com/questions/168247/mysql-working-with-192-trillion-records-yes-192-trillion
How many rows are 'too many' for a MySQL table?
You can't load all the data into your page at once. 600 million records is too many.
Since you mentioned select2, have a look at their example with pagination.
The trick is to limit your SQL results to maybe 100 or so at a time. When the user scrolls to the bottom of the list, it can automatically load in more.
Send the search query to the server, and do the filtering in SQL (or NoSQL or whatever you use). Database engines are built for that. Don't try filtering/sorting in JS with that many records.
This question is kind of a two-parter. I would appreciate it if I could get answers for both parts.
Part 1 - Wider date range shows less visits than narrower range
I am having trouble with a query of mine. These are the dimensions and metrics and filters I am using:
metrics: ga:visits,ga:visitBounceRate,ga:goalCompletionsAll,ga:goalConversionRateAll
dimensions: ga:landingPagePath,ga:medium
filters: ga:landingPagePath==/online-access/benefits-online-account-video.html;ga:visits>1;ga:medium==organic,ga:hasSocialSourceReferral==Yes,ga:medium==cpc,ga:medium==CPC
If start-date is 2015-01-01 and end-date is 2015-05-30, the query returns 0 for all fields, but if I change the start date to 2015-02-01 and the end date to 2015-02-28, I get 17 visits, and the other metric fields have values that look more correct. After investigating it a little bit, I thought maybe it had something to do with a dimensions-metric mismatch, but at this point I really have no idea.
Part 2 - Matching Dimensions and Metrics
I am a little confused about how you can combine metrics and dimensions. For example, the Dimensions and Query Explorer seems to suggest that ga:sessions can only be combined with ga:sessionDurationBucket or the deprecated ga:visitLength. But this sample code from the GA API reference shows ga:sessions combined with ga:source and ga:keyword - Neither of which are listed as compatible dimensions in the explorer.
Conclusion
Can someone make sense of this to me? I'm pretty new to Google Analytics in general, let alone accessing it through the API, so I may not have yet picked up the fundamentals needed to really understand what is happening here.
Edit: The reports in the actual Google Analytics interface say that it used 100 percent of sessions, so I believe that means it is not using sampled data. Also, the graphs show sessions up in the millions for the wider date range. And if I narrow the date range to just February, it shows 1,189,675 sessions~~, so 17 sessions is ridiculously wrong anyway.~~ I just realized this is not filtered for the specific page I'm filtering for in the query.
(I know I'm using some deprecated ga: values, but it really should work the same, no?)
Part 1 -- you are likely running into a sampling issue. Although your edit says you are getting unsampled data, it is not clear if you are looking at the same report (and date range) as your query. Some reports are unsampled, but when you add custom filters, they tend to increase the probability that GA will sample. Try to rebuild the report in your GA user interface first, and determine if it's sampled or not. (Click on the Customization tab in this image.)
Part 2 -- I'm not sure where you are getting 'compatible' dimensions and metrics here. The real conflict in Google Analytics is whether you're filtering by users or by sessions. If you're using sessions, then you can extract any dimension (think x-axis on a chart) that gives information about a session: where it came from (i.e. source and medium), when it occurred (i.e. date), which session duration bucket it falls into. The dimension is the grouping parameter for your metric (think y-axis).
So in the sample code you linked:
Get apiQuery = analytics.data().ga()
.get(tableId, // Table Id.
"2012-01-01", // Start date.
"2012-01-15", // End date.
"ga:sessions") // Metrics.
.setDimensions("ga:source,ga:keyword")
.setSort("-ga:sessions,ga:source")
.setFilters("ga:medium==organic")
.setMaxResults(25);
This is going to pull the number of sessions (your metric) that occurred from 2012-01-01 to 2012-01-15, grouped by source and keyword (your dimensions), for the organic medium only, then sorted by sessions and source for readability. It's the same information that shows up in your Google Analytics -> Acquisition reporting menu, so it's kosher to query the API the same way.
I have been working on dc and crossfilter js and I currently have a large dataset with 550,000 rows and size 60mb csv and am facing a lot of issues with it like browser crashes etc
So , I'm trying to understand how dc and crossfilter deals with large datasets.
http://dc-js.github.io/dc.js/
The example on their main site runs very smoothly and after seeing timelines->memory (in console) it goes to a max of 34 mb and slowly reduces with time
My project is taking up memory in the range of 300-500mb per dropdown selection, when it loads a json file and renders the entire visualization
So, 2 questions
What is the backend for the dc site example? Is it possible to find out the exact backend file?
How can I reduce the data overload on my RAM from my application, which is running very slowly and eventually crashing?
Hi you can try running loading the data, and filtering it on the server. I faced a similar problem when the size of my dataset was being too big for the browser to handle.
I posted a question a few weeks back as to implementing the same. Using dc.js on the clientside with crossfilter on the server
Here is an overview of going about it.
On the client side, you'd want to create fake dimensions and fake groups that have basic functionality that dc.js expects(https://github.com/dc-js/dc.js/wiki/FAQ#filter-the-data-before-its-charted). You create your dc.js charts on the client side and plug in the fake dimensions and groups wherever required.
Now on the server side you have crossfilter running(https://www.npmjs.org/package/crossfilter). You create your actual dimensions and groups here.
The fakedimensions have a .filter() function that basically sends an ajax request to the server to perform the actual filtering. The filtering information could be encoded in the form of a query string. You'd also need a .all() function on your fake group to return the results of the filtering.
I'm just starting to approach this problem, I want to to allow users to arbitrarily select ranges and filters that allow them to graph large data sets (realistically it should never be more than 10 million data points) on a web page. I use elasticsearch as the method of storing and aggregating the data, along with redis for keeping track of summary data, and d3.js is my graphing library.
My thoughts on the best solution is to have precalculated summaries in different groupings that can be used to graph from. So if the data points exist over several years, I can have groupings by month and day (which I would be doing anyway), but then by groupings of say half day, quarter day, hour, half hour, etc. And then before I query for graph data I do a quick calculation to see which of these groupings will give me some ideal number of data points (say 1000).
Is this a reasonable way to approach the problem? Is there a better way?
You should reconsider the amount of data...
Even in desktop plotting apps it is uncommon to show that many points per plot - e.g. origin prints a warning that it will show only a subset for performance reasons. you could for example throw away every 3rd point to make them less.
You should give the user the ability to zoom in or navigate around to explore the data, like pagination alike style ...
Grouping or faceting how it is called in Lucene community is of course possible with that many documents but be sure you have enough RAM+CPU
You can't graph (typically) more points than you have dots on your screen. So to graph 1M points you'd need a really good monitor.
I'm using MIT's Simile to display thumbnails and links with faceted filtering. I works great, but large data sets (greater than 500 elements) start to slow significantly. My user base will tolerate seconds, but not 10's of seconds, and certainly not minutes while the page renders.
Is it the volume of data in the JSON structure?
Is it Simile's method of parsing?
Too slow compared to what? Its probably faster than XML and easier to implement compared to your own custom binary format.
Exhibit version 3 (http://simile-widgets.org/exhibit ) provides good interaction with up to 100,000 items. Displaying them all can take some time if the individual items' lens template is complicated, but if you use pagination then loading, filtering, and display are all pretty quick.