Fusion Tables query speed

Fusion Tables query speed - javascript

I'm trying to implement the autocomplete logics available in the Fusion Tables interface using just client side JavaScript:
So far I found this, which works great: https://developers.google.com/fusiontables/docs/samples/autocomplete
It allows me to retrieve all the values for a property, grouped together, so I can autocomplete them. The issue is that it's extremely slow. The query
"SELECT 'Store Name', COUNT() " +
'FROM ' + tableId + " GROUP BY 'Store Name'
takes up to 10 seconds to run, each time. This is because my table is quite big with more than 150 thousand rows.
However, the native interface from the screenshot above is dead fast. I tried looking into the code and see what type of query they were making (maybe they have a cache of these results), but I cannot find anything to lead me to a solution.
Any ideas? My thinking is that if the Google native interface is doing it, then there most definitely is a way for me to do it as well... I want to avoid having to use a third party server to cache these results, that would be an easy fix, and it's not the solution to my problem.

I think they use something like a nested set and a trie datastructure on the server side. A nested set is fast for queries but not for insertion and a trie datastructure is also fast for text retrieval. I think you can combine the 2 to make a fast look-up.

Related

How to handle an extremely big table in a search?

I'm looking for suggestions on how to go about handling the following use case scenario with python django framework, i'm also open to using javascript libraries/ajax.
I'm working with pre-existing table/model called revenue_code with over 600 million rows of data.
The user will need to search three fields within one search (code, description, room) and be able to select multiple search results similar to kendo controls multi select. I first started off by combining the codes in django-filters as shown below, but my application became unresponsive, after waiting 10-15 minutes i was able to view the search results but couldn't select anything.
https://simpleisbetterthancomplex.com/tutorial/2016/11/28/how-to-filter-querysets-dynamically.html
I've also tried to use kendo controls, select2, and chosen because i need the user to be able to select as many rev codes as they need upward to 10-20, but all gave the same unresponsive page when it attempted to load the data into the control/multi-select.
Essentially what I'm looking for is something like this below, which allows the user to select multiple selections and will handle a massive amount of data without becoming unresponsive? Ideally i'd like to be able to query my search without displaying all the data.
https://petercuret.com/add-ajax-to-django-without-writing-javascript/
Is Django framework meant to handle this type of volume. Would it be better to export this data into a file and read the file? I'm not looking for code, just some pointers on how to handle this use case.

What the basic mechanism of "searching 600 millions"? Basically how database do that is to build an index, before search-time, and sufficiently general enough for different types of query, and then at search time you just search on the index - which is much smaller (to put into memory) and faster. But no matter what, "searching" by its nature, have no "pagination" concept - and if 600 millions record cannot go into memory at the same time, then multiple swapping out and in of parts of the 600 millions records is needed - the more parts then the slower the operation. These are hidden behind the algorithms in databases like MySQL etc.
There are very compact representation like bitmap index which can allow you to search on data like male/female very fast, or any data where you can use one bit per piece of information.
So whether Django or not, does not really matters. What matters is the tuning of database, the design of tables to facilitate the queries (types of indices), and the total amount of memory at server end to keep the data in memory.
Check this out:
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
https://serverfault.com/questions/168247/mysql-working-with-192-trillion-records-yes-192-trillion
How many rows are 'too many' for a MySQL table?

You can't load all the data into your page at once. 600 million records is too many.
Since you mentioned select2, have a look at their example with pagination.
The trick is to limit your SQL results to maybe 100 or so at a time. When the user scrolls to the bottom of the list, it can automatically load in more.
Send the search query to the server, and do the filtering in SQL (or NoSQL or whatever you use). Database engines are built for that. Don't try filtering/sorting in JS with that many records.

OData CRM query: Any open opportunities with a given target?

Our organization has a CRM installation on which we've done extensive customization. Right now I'm trying to implement a solution to enforce a business rule: Prevent users from updating a program to inactive when the program is a designation opportunity on an open opportunity.
I know how to prevent the update; return false from OnSave() in the JavaScript. I haven't been able to find out when that's the case. The best idea I've come up with is to make a SOAP call to the OData endpoint in CRM, but I've come across a sticking point at the last step. (If you've got a better idea I'm totally open to it.)
Here's what I've got. I can get the program in question:
programset(guid'thisone')
.../OrganizationData.svc/uwkc_programSet(guid'F4D75E9D-3A79-E611-80DA-
C4346BACAAC0')
I can get the associated designations:
programset(guid'thisone')/program-desig
.../OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation
and the associated Opportunities to those:
programset(guid'thisone')/program-desig?$expand=desig-opportunity
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation
... but now I get a little stuck.
I can filter on a primitive value on the Opportunity (link + field)
...$filter=opp-oppdesig/EstimatedCloseDate gt DateTime('2016-07-01')
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=uwkc_opportunity_uwkc_opportunitydesignation/EstimatedCloseDate%20gt%20DateTime%272016-07-01%27
and I can filter on a complex value on the Designation (field + value)
...$filter=statecode/Value gt 0
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=statecode/Value%20gt%200
but I can't make a filter work on a complex value on the Opportunity (connection + field + value)
...$filter=opp-oppdesig/statecode/Value gt 0
...OrganizationData.svc/uwkc_programSet(guid'F0D75E9D-3A79-E611-80DA-C4346BACAAC0')/uwkc_uwkc_program_uwkc_opportunitydesignation?$expand=uwkc_opportunity_uwkc_opportunitydesignation&$filter=uwkc_opportunity_uwkc_opportunitydesignation/statecode/Value%20gt%200
No property 'statecode' exists in type 'Microsoft.Xrm.Sdk.Entity' at
position 45.
How can I filter on the state of an entity two away from what I'm looking at? Or, if there's a better way, what's the best way to prevent in-use programs from being deactivated?

First issue is that you should be using the schema name of the attribute (StateCode), and not the logical name (statecode).
However, I believe it will then just return another error message:
<error xmlns="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<code>-2147220989</code>
<message xml:lang="en-US">attributeName</message>
</error>
For some reason, it seems that filtering on complex types in an expanded entity does not work properly for the SOAP endpoint. And from what I have tested, the new Web API does not support this kind of depth in a query either yet.
One solution to your problem is to just fetch all the results, and then perform the filtering manually in your code. This of course works best if you can assume that there are not too many related entities retrieved in this kind of query. Also be sure to use $select to retrieve only the necessary attributes, as it greatly reduces the time it takes for a query to finish.
Another solution is to perform the query using FetchXML instead. This can be done either via the Web API, or as a SOAP request you construct yourself.
A third solution is to split your query into multiple queries, so you don't have to filter on a state two entities away in a query.

Performance issues with EmberJS and Rails 4 API

I have an EmberJS application which is powered by a Rails 4 REST API. The application works fine the way it is, however it is becoming very sluggish based on the kind of queries that are being performed.
Currently the API output is as follows:
"projects": [{
"id": 1,
"builds": [1, 2, 3, 4]
}]
The problem arises when a user has lots of projects with lots of builds split between them. EmberJS currently looks at builds key then makes a request to /builds?ids[]=1&ids[]=2 which is the kind of behaviour I want.
This question could have one of two solutions.
Update Rails to load the build_ids more efficiently
Update EmberJS to support different queries for builds
Option 1: Update Rails
I have tried various solutions regarding eager loading and manually grabbing the IDs using custom methods on the serializer. Both of these solution add a lot of extra code that I'd rather not do and still do individual queries per project.
By default rails also does SELECT * style queries when doing has_many and I can't figure out how to overwrite this at the serializer layer. I also wrote a horrible solution which got the entire thing to one fast query but it involved writing raw SQL which I know isn't the Rails way of doing things and I'd rather not have such a huge complex untestable query as the default scope.
Option 2: Make Ember use different queries
Instead of requesting /builds?ids[]=1&ids[]=2 I would rather not include the builds key at all on the project and make a request to /builds?project_id=1 when I access that variable within Ember. I think I can do this manually on a per field basis by using something similar to this:
builds: function () {
return this.store.find('builds', { project_id: this.get('id') });
}.property()
instead of the current:
builds: DS.hasMany('build', { async: true })
It's also worth mentioning that this doesn't only apply to "builds". There are 4 other keys on the project object that do the same thing so that's 4 queries per project.

Have you made sure that you have properly added indexes to your database? Adding and index on the builds table on project_id will make it work a lot faster.
Alternatively you should use the links attribute to load your records.
{"projects": [{
"id": 1,
"links": {
"builds": "/projects/1/builds"
}
}]}
This means that the builds table will only be queried when the relationships is accessed.

Things you can try:
Make sure your rails controller only selects the columns needed for JSON serialization.
Ensure you have indexes on the columns present in your where and join clauses unless the column is boolean or has low number of distinct values. Always ensure you have indexes on foreign key columns.
Be VERY VERY careful with how you are using ActiveRecord joins vs includes vs preload vs eager and references. This area is fraught with problems composing scopes together and subtle things can alter the SQL generated and number of queries issued and even what actual results are returned. I noticed differences in minor point releases of AR 4 yielding different query results because of the join strategy AR would choose.
Often you want to aim to reduce the number of SQL's issued to the database but joining tables is not always the best solution. You will need to benchmark and use EXPLAIN to see what works better for your queries. Sometimes sub queries/sub-selects can be more efficient.
Querying by parent_id is a good option if you can get Ember Data to perform the request that way as the database has a simpler query.
You could consider using Ember-Model instead of Ember-Data, I am using it currently as its much simpler and easier to adapt to my needs, and supports multi-fetch to avoid 1+N request problems.
You may be able to use embedded models or side-loaded models so your server can reduce the number of web requests AND the number of SQLs and return what the client needs in one request / one SQL. Ember-Model supports both embedded and side-loaded models, so Ember-Data being more ambitious may also.
Although it appears from your question that Ember-Data is doing a multi-fetch, make sure you are doing SQL IN clause for those ID's instead of separate queries.
Make sure that the SQL on your rails side is not fanning out in a 1+N pattern. Using the includes options to effect eager loading on AR relations may help avoid 1+N queries or it may unnecessarily load models depending on the results needed in your response.
I also found that the Ruby JSON serializer libraries are less than optimal. I created a gem ToJson that speeds up JSON serializing many times over the existing solutions. You can try it and benchmark for yourself.
I found that ActiveRecord (including AR 4) didn't work well for me and I moved to Sequel in the end because it gave me so much more control over join types, join conditions, and query composition and tactical eager loading, plus it was just faster, has wider support for standard SQL features and excellent support for postgres features and extensions. These things can make a huge difference to the way you design your database schema and the performance and types of queries you can achieve.
Using Sequel and ToJson I can serve around 30-50 times more requests than I could with ActiveRecord + JBuilder for most of my queries, and in some instances its hundreds times better than what I was achieving with AR (especially creates/updates). Besides Sequel being faster at instantiating models from the DB, it also has a Postgres streaming adapter which makes it even faster again for large results.
Changing your data access/ORM layer and JSON serialisation can achieve 30-50 times faster performance or alternatively require managing 30-50 less servers for the same load. It's nothing to sneeze at.

Best Practices for displaying large lists

Are there any best practices for returning large lists of orders to users?
Let me try to outline the problem we are trying to solve. We have a list of customers that have 1-5,000+ orders associated to each. We pull these orders directly from the database and present them to the user is a paginated grid. The view we have is a very simple "select columns from orders" which worked fine when we were first starting but as we are growing, it's causing performance/contention problems. Seems like there are a million and one ways to skin this cat (return only a page worth of data, only return the last 6 months of data, etc.) but like I said before just wondering if there are any resources out there that provide a little more hand holding on how to solve this problem.
We use SQL Server as our transaction database and select the data out in XML format. We then use a mixture of XSLT and Javascript to create our grid. We aren't married to the presentation solution but are married to the database solution.

My experience.
Always set default values in the UI for the user that are reasonable. You don't want them clicking "Retrieve" and getting everything.
Set a limit to the number of records that can be returned.
Only return from the database the records you are going to display.
If forward/backward consistencency is important, store the entire results set from the query in a temp table and return just the page you need to display. When paging up/down retrieve the next set from the temp table.
Make sure your indexs are covering your queries.
Use different queries for different purposes. Think "Open Orders" vs "Closed Orders". These might perfrom much better as different queries instead of one generic query.
Set parameter defualts in the stored procedures. Protect your query from a UI that is not setting reasonable limits.
I wish we did all these things.

I'd recommend doing some profiling to find the actual bottlenecks. Perhaps you have access to Visual Studio Profiler? http://msdn.microsoft.com/en-us/magazine/cc337887.aspx There are plenty of good profilers out there.
Otherwise, my first stop would be pagination to bring back less records from the db, which is easier on the connection and the memory footprint. Take a look at this (I'm assuming you're on SQL Server >= 2005)
http://www.15seconds.com/issue/070628.htm

I"m not sure from the question exactly what UI problem you are trying to solve.
If it's that the customer can't work with a table that is just one big amorphous blob, then let him sort on the fields: order date, order number, your SKU number, his SKU number maybe, and I guess others,too. He might find it handy to do a multi-column stable sort, too.
If it's that the table headers scroll up and disappears when he scrolls down through his orders, that's more difficult. Read the SO discussion to see if the method there gives a solution you can use.
There is also a JQuery mechanism for keeping the header within the viewport.
HTH
EDIT: plus I'll second #Iain 's answer: do some profiling.
Another EDIT: #Scott Bruns 's answer reminded me that when we started designing the UI, the biggest issue by far was limiting the number of records the user had to look at. So, yes I agree with Scott that you should give the user some way to see only a limited number of records right from the start; that is, before he ever sees a table, he has told you a lot about what he wants to see.

Stupid question, but have you asked the users of your application for input on what records that they would like to see initially?

Handling large grid datasets in JavaScript

What are some of the better solutions to handling large datasets (100K) on the client with JavaScript. In particular, if you have multi-column sort and search capabilities, how do you handle fetching (and pre-fetching) the data, client side model binding (for display), and caching the data.
I would imagine a good solution would be doing some thoughtful work in the background. For instance, initially, if the table was displaying N items, it might fetch 2N items, return the data for the user, and then go fetch the next 2N items in the background (even if the user hasn't requested this). As the user made search/sort changes, it would throw out (or maybe even cache the initial base case), and do similar functionality.
Can you share the best solutions you have seen?
Thanks

Use a jQuery table plugin like DataTables: http://datatables.net/
It supports server-side processing for sorting, filtering, and paging. And it includes pipelining support to prefetch the next x pages of records: http://www.datatables.net/examples/server_side/pipeline.html
Actually the DataTables plugin works 4 different ways:
1. With an HTML table, so you could send down a bunch of HTML and then have all the sorting, filtering, and paging work client-side.
2. With a JavaScript array, so you could send down a 2D array and let it create the table from there.
3. Ajax source - which is not really applicable to you.
4. Server-side, where you send data in JSON format to an empty table and let DataTables take it from there.

SlickGrid does exactly what you're looking for. (Demo)
Using the AJAX data store, SlickGrid can handle millions of rows without flinching.

Since you tagged this with Ext JS, I'll point you to Ext.ux.LiveGrid if you haven't already seen it. The source is available, so you might have a look and see how they've addressed this issue. This is a popular and widely-used extension in the Ext JS world.
With that said, I personally think (virtually) loading that much data is useless as a user experience. Manually pulling a scrollbar around (jumping hundreds of records per pixel) is a far inferior experience to simply typing what you want. I'd much prefer some robust filtering/searching instead of presenting that much data to the user.
What if you went to Google and instead of a search box, it just loaded the entire internet into one long virtual list that you had to scroll through to find your site... :)

It depends on how the data will be used.
For a large dataset, where the browser's Find function was adequate, just returning a straight HTML table was effective. It takes a while to load, but the display is responsive on older, slower clients, and you never have to worry about it breaking.
When the client did the sorting and search, and you're not showing the entire table at once, I had the server send tab-delimited tables through XMLHTTPRequest, parsed them in the browser with list = String.split('\n'), and updated the display with repeated calls to $('node').innerHTML = 'blah'. The JS engine can store long strings pretty efficiently. That ran a lot faster on the client than showing, hiding, and rearranging DOM nodes. Creating and destroying new DOM nodes on the fly turned out to be really slow. Splitting each line into fields on-demand seems to work; I haven't experimented with that degree of freedom.
I've never tried the obvious pre-fetch & background trick, because these other methods worked well enough.

Check out this comprehensive list of data grids and
spreadsheets.
For filtering/sorting/pagination purposes you may be interested in great Handsontable, or DataTables as a free alternative.
If you need simply display huge list without any additional features Clusterize.js should be sufficient.

Develop Reference

JavaScript is the programming language of the Web.