How to handle an extremely big table in a search?

How to handle an extremely big table in a search? - javascript

I'm looking for suggestions on how to go about handling the following use case scenario with python django framework, i'm also open to using javascript libraries/ajax.
I'm working with pre-existing table/model called revenue_code with over 600 million rows of data.
The user will need to search three fields within one search (code, description, room) and be able to select multiple search results similar to kendo controls multi select. I first started off by combining the codes in django-filters as shown below, but my application became unresponsive, after waiting 10-15 minutes i was able to view the search results but couldn't select anything.
https://simpleisbetterthancomplex.com/tutorial/2016/11/28/how-to-filter-querysets-dynamically.html
I've also tried to use kendo controls, select2, and chosen because i need the user to be able to select as many rev codes as they need upward to 10-20, but all gave the same unresponsive page when it attempted to load the data into the control/multi-select.
Essentially what I'm looking for is something like this below, which allows the user to select multiple selections and will handle a massive amount of data without becoming unresponsive? Ideally i'd like to be able to query my search without displaying all the data.
https://petercuret.com/add-ajax-to-django-without-writing-javascript/
Is Django framework meant to handle this type of volume. Would it be better to export this data into a file and read the file? I'm not looking for code, just some pointers on how to handle this use case.

What the basic mechanism of "searching 600 millions"? Basically how database do that is to build an index, before search-time, and sufficiently general enough for different types of query, and then at search time you just search on the index - which is much smaller (to put into memory) and faster. But no matter what, "searching" by its nature, have no "pagination" concept - and if 600 millions record cannot go into memory at the same time, then multiple swapping out and in of parts of the 600 millions records is needed - the more parts then the slower the operation. These are hidden behind the algorithms in databases like MySQL etc.
There are very compact representation like bitmap index which can allow you to search on data like male/female very fast, or any data where you can use one bit per piece of information.
So whether Django or not, does not really matters. What matters is the tuning of database, the design of tables to facilitate the queries (types of indices), and the total amount of memory at server end to keep the data in memory.
Check this out:
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
https://serverfault.com/questions/168247/mysql-working-with-192-trillion-records-yes-192-trillion
How many rows are 'too many' for a MySQL table?

You can't load all the data into your page at once. 600 million records is too many.
Since you mentioned select2, have a look at their example with pagination.
The trick is to limit your SQL results to maybe 100 or so at a time. When the user scrolls to the bottom of the list, it can automatically load in more.
Send the search query to the server, and do the filtering in SQL (or NoSQL or whatever you use). Database engines are built for that. Don't try filtering/sorting in JS with that many records.

Related

Call SQL "function" (stored procedure?) every time a database column is selected

I am running MySQL 5.6. I have a number of various "name" columns in the database (in various tables). These get imported every year by each customer as a CSV data dump. There are a number of places that these names are displayed throughout this website. The issue is, the names have almost no formatting (and to this point, no sanitization existed upon importation):
Phil Eaton, PHIL EATON, Phil EATON, etc.
Thus, the website sometimes look like a mess when these names are involved. There are a number of ways that I can think to do this, but none that are particularly appealing.
First, I can have a filter in Javascript. However, as I said, these names exist in a number of places throughout this (large) site. I may end up missing a page. The names do not exist already within nice "name"-classed divs/spans, etc.
Second, I could filter in PHP (the backend). This seems about as effective as doing it in Javascript. I could do it on the API, but there was still not a central method for pulling names from the database. So I could still miss an API call anyway.
Finally, the obvious "best" way is to sanitize the existing data in place for each name column. Then at the same time, immediately start sanitizing all names that get imported each time we add a customer. The issue with the first part of this is that there are hundreds of millions of rows of names in the database. Updating these could take a long amount of time and be disruptive to the clients' daily routines.
So, the most appealing way to correct this in the short-term is to invoke a function every time a column is selected. In this way I could "decorate" every name column with a formatting function so the names will appear uniform on the frontend. So ultimately, my question is: is it possible to invoke a specific function in SQL to format each row every time a specific column is selected? In other words, maybe can I call a stored procedure every time a column is selected? (Point being, I'm trying to keep the formatting in SQL to avoid the propagation of usage.)

In MySQL you can't trigger something on SELECT, but I have an idea (it's only an idea, now I don't have time to try it, sorry).
You probably can create a VIEW on this table, with the same structure, but with the stored procedure applied to the names fields, and select from this view in your PHP.
But it has two backdraw:
You have to modify all your SELECT statements in your PHPs.
The server will always call that procedure. Maybe you can store the formatted values, then check for it (cache them).
On the other hand I agree with HLGEM, I also suggest to format the data on import, because it's a very bad practice to import something you don't check into a DB (SQL Injections?). The batch tasking is also a good idea to clean up the mess.

I presume names are called frequently so invoking a sanitization function every time they are called could severely slow down your system. Further, you can't just do a simple setting to get this, you would have to change every buit of SQL code that is run that includes names.
Personally how I would handle it is to fix the imports so they put in a sanitized version for new names. It is a bad idea to directly put any data into a database without some sort of staging and clean up.
Then I would tackle the old names and fix them in batches in a nightly run that is scheduled when the fewest people are using the system. You would have to do some testing on dev to determine how big a batch you could run without interfering with other things the database is doing. The alrger the batch the sooner you would get through all the names, but even though this will take time, it is the surest method of getting the data cleaned up and over time the data will appear better to the users. If the design of your datbase allows you to identify which are the more active names (such as an is_active flag for a customer or am order in the last year), I would prioritize the update by that. Alternatively, you could clean up one client at a time starting with whichever one has noticed the problem and is driving this change.

Other answers before give some possible solutions. But, the short answer for the specific option you are asking is : No. There is no such thing called a
"Select Statement Trigger", that too for a single column, although triggers come close for this kind of expectation, but only for Insert, Update and Delete operations.

Faster search String in Mysql Database

I have a large DB having > 20,000 rows. I have two tables songs and albums:
table songs contains songid, albumid, songname and table albums contains albumid, albumname
Currently when user searches for a song I give results instantly as soon as he starts typing. Just like Google Instant.
What I am using is: Everytime user types I send that query string to my backend php file and there I execute that query in my DB like this:
SELECT * FROM songs, albums WHERE songs.albumid = albums.albumid AND songs.songname LIKE '%{$query_string}%';
But it's very inefficient to use DB queries everytime and also it's not scalable as my DB is growing everyday.
Therefore I want the same feature but faster and efficient and scalable.
Also, I dont want it to be exact pattern matching, for example:
Suppose, if user types "Rihana" instead of "Rihanna" then it should be able to give the results related to Rihanna.
Thanks.

You should index table Songs songname column on first n chars, say 6, to get better performance for the query.
Trigger the query search only after n chars have been typed, say 3 (jquery autocomplete has this option, for example)
You may also consider an in-memory DB if performance is truly crucial (sounds like it is) and the amount of data will not consume too much resident memory.
Google, btw, does not use a legacy RDBMS to perform its absurdly fast searches (continually amazed...)

First of all you should find MySQL's FULLTEXT search support far faster than your current approach.
I suspect with the kind of speed you'd like from this solution and the support for searching for mis-spelled words that you'd be better off investigating some kind of more featured full text search engine. These include:
Sphinx Search
Solr
ElasticSearch

Try full text search.
Indexing requires MyISAM tables though.
If you need ACID and full text search, use PostgreSQL.

Best Practices for displaying large lists

Are there any best practices for returning large lists of orders to users?
Let me try to outline the problem we are trying to solve. We have a list of customers that have 1-5,000+ orders associated to each. We pull these orders directly from the database and present them to the user is a paginated grid. The view we have is a very simple "select columns from orders" which worked fine when we were first starting but as we are growing, it's causing performance/contention problems. Seems like there are a million and one ways to skin this cat (return only a page worth of data, only return the last 6 months of data, etc.) but like I said before just wondering if there are any resources out there that provide a little more hand holding on how to solve this problem.
We use SQL Server as our transaction database and select the data out in XML format. We then use a mixture of XSLT and Javascript to create our grid. We aren't married to the presentation solution but are married to the database solution.

My experience.
Always set default values in the UI for the user that are reasonable. You don't want them clicking "Retrieve" and getting everything.
Set a limit to the number of records that can be returned.
Only return from the database the records you are going to display.
If forward/backward consistencency is important, store the entire results set from the query in a temp table and return just the page you need to display. When paging up/down retrieve the next set from the temp table.
Make sure your indexs are covering your queries.
Use different queries for different purposes. Think "Open Orders" vs "Closed Orders". These might perfrom much better as different queries instead of one generic query.
Set parameter defualts in the stored procedures. Protect your query from a UI that is not setting reasonable limits.
I wish we did all these things.

I'd recommend doing some profiling to find the actual bottlenecks. Perhaps you have access to Visual Studio Profiler? http://msdn.microsoft.com/en-us/magazine/cc337887.aspx There are plenty of good profilers out there.
Otherwise, my first stop would be pagination to bring back less records from the db, which is easier on the connection and the memory footprint. Take a look at this (I'm assuming you're on SQL Server >= 2005)
http://www.15seconds.com/issue/070628.htm

I"m not sure from the question exactly what UI problem you are trying to solve.
If it's that the customer can't work with a table that is just one big amorphous blob, then let him sort on the fields: order date, order number, your SKU number, his SKU number maybe, and I guess others,too. He might find it handy to do a multi-column stable sort, too.
If it's that the table headers scroll up and disappears when he scrolls down through his orders, that's more difficult. Read the SO discussion to see if the method there gives a solution you can use.
There is also a JQuery mechanism for keeping the header within the viewport.
HTH
EDIT: plus I'll second #Iain 's answer: do some profiling.
Another EDIT: #Scott Bruns 's answer reminded me that when we started designing the UI, the biggest issue by far was limiting the number of records the user had to look at. So, yes I agree with Scott that you should give the user some way to see only a limited number of records right from the start; that is, before he ever sees a table, he has told you a lot about what he wants to see.

Stupid question, but have you asked the users of your application for input on what records that they would like to see initially?

Handling large grid datasets in JavaScript

What are some of the better solutions to handling large datasets (100K) on the client with JavaScript. In particular, if you have multi-column sort and search capabilities, how do you handle fetching (and pre-fetching) the data, client side model binding (for display), and caching the data.
I would imagine a good solution would be doing some thoughtful work in the background. For instance, initially, if the table was displaying N items, it might fetch 2N items, return the data for the user, and then go fetch the next 2N items in the background (even if the user hasn't requested this). As the user made search/sort changes, it would throw out (or maybe even cache the initial base case), and do similar functionality.
Can you share the best solutions you have seen?
Thanks

Use a jQuery table plugin like DataTables: http://datatables.net/
It supports server-side processing for sorting, filtering, and paging. And it includes pipelining support to prefetch the next x pages of records: http://www.datatables.net/examples/server_side/pipeline.html
Actually the DataTables plugin works 4 different ways:
1. With an HTML table, so you could send down a bunch of HTML and then have all the sorting, filtering, and paging work client-side.
2. With a JavaScript array, so you could send down a 2D array and let it create the table from there.
3. Ajax source - which is not really applicable to you.
4. Server-side, where you send data in JSON format to an empty table and let DataTables take it from there.

SlickGrid does exactly what you're looking for. (Demo)
Using the AJAX data store, SlickGrid can handle millions of rows without flinching.

Since you tagged this with Ext JS, I'll point you to Ext.ux.LiveGrid if you haven't already seen it. The source is available, so you might have a look and see how they've addressed this issue. This is a popular and widely-used extension in the Ext JS world.
With that said, I personally think (virtually) loading that much data is useless as a user experience. Manually pulling a scrollbar around (jumping hundreds of records per pixel) is a far inferior experience to simply typing what you want. I'd much prefer some robust filtering/searching instead of presenting that much data to the user.
What if you went to Google and instead of a search box, it just loaded the entire internet into one long virtual list that you had to scroll through to find your site... :)

It depends on how the data will be used.
For a large dataset, where the browser's Find function was adequate, just returning a straight HTML table was effective. It takes a while to load, but the display is responsive on older, slower clients, and you never have to worry about it breaking.
When the client did the sorting and search, and you're not showing the entire table at once, I had the server send tab-delimited tables through XMLHTTPRequest, parsed them in the browser with list = String.split('\n'), and updated the display with repeated calls to $('node').innerHTML = 'blah'. The JS engine can store long strings pretty efficiently. That ran a lot faster on the client than showing, hiding, and rearranging DOM nodes. Creating and destroying new DOM nodes on the fly turned out to be really slow. Splitting each line into fields on-demand seems to work; I haven't experimented with that degree of freedom.
I've never tried the obvious pre-fetch & background trick, because these other methods worked well enough.

Check out this comprehensive list of data grids and
spreadsheets.
For filtering/sorting/pagination purposes you may be interested in great Handsontable, or DataTables as a free alternative.
If you need simply display huge list without any additional features Clusterize.js should be sufficient.

Dynamic filtering, am I doing it wrong?

So I have an umbraco site with a number of products in it that is content managed, I need to search/filter this dataset on the front end based on 5 criteria.
I'd estimate I will have 300 products. I need to filter this data very fast and hide show options that are no longer relevant based on the previous selections.
I'm currently building a webservice and jquery implementation using AJAX.
Is the best way to do this to load it into a javascript data structure and operate on it there or will AJAX calls be fast enough? Obviously this will mean duplicating the functionality on the server side for non-javascript users.

If you need to filter the data "very fast" then I imagine the best way is to preload all the data then manipulate it client side. If you're waiting for an Ajax response every time the user needs to filter the data then it's not going to be as fast as filtering it on the client (assuming they haven't got an ancient computer running IE6).
It would depend on the complexity of your filtering. If all your doing is showing results where, for example, the product's price is greater than $10, then that will definitely be much faster. If you're going to be doing complex searches then it's possible that it could be faster to process serverside. The other question is how much data is saved for each product - preloading a few hundred products with a lot of data may take some time.
As always, the only way you'll truly be able to answer this question is by profiling the two solutions.

Develop Reference

JavaScript is the programming language of the Web.