Performance issues with EmberJS and Rails 4 API

Performance issues with EmberJS and Rails 4 API - javascript

I have an EmberJS application which is powered by a Rails 4 REST API. The application works fine the way it is, however it is becoming very sluggish based on the kind of queries that are being performed.
Currently the API output is as follows:
"projects": [{
"id": 1,
"builds": [1, 2, 3, 4]
}]
The problem arises when a user has lots of projects with lots of builds split between them. EmberJS currently looks at builds key then makes a request to /builds?ids[]=1&ids[]=2 which is the kind of behaviour I want.
This question could have one of two solutions.
Update Rails to load the build_ids more efficiently
Update EmberJS to support different queries for builds
Option 1: Update Rails
I have tried various solutions regarding eager loading and manually grabbing the IDs using custom methods on the serializer. Both of these solution add a lot of extra code that I'd rather not do and still do individual queries per project.
By default rails also does SELECT * style queries when doing has_many and I can't figure out how to overwrite this at the serializer layer. I also wrote a horrible solution which got the entire thing to one fast query but it involved writing raw SQL which I know isn't the Rails way of doing things and I'd rather not have such a huge complex untestable query as the default scope.
Option 2: Make Ember use different queries
Instead of requesting /builds?ids[]=1&ids[]=2 I would rather not include the builds key at all on the project and make a request to /builds?project_id=1 when I access that variable within Ember. I think I can do this manually on a per field basis by using something similar to this:
builds: function () {
return this.store.find('builds', { project_id: this.get('id') });
}.property()
instead of the current:
builds: DS.hasMany('build', { async: true })
It's also worth mentioning that this doesn't only apply to "builds". There are 4 other keys on the project object that do the same thing so that's 4 queries per project.

Have you made sure that you have properly added indexes to your database? Adding and index on the builds table on project_id will make it work a lot faster.
Alternatively you should use the links attribute to load your records.
{"projects": [{
"id": 1,
"links": {
"builds": "/projects/1/builds"
}
}]}
This means that the builds table will only be queried when the relationships is accessed.

Things you can try:
Make sure your rails controller only selects the columns needed for JSON serialization.
Ensure you have indexes on the columns present in your where and join clauses unless the column is boolean or has low number of distinct values. Always ensure you have indexes on foreign key columns.
Be VERY VERY careful with how you are using ActiveRecord joins vs includes vs preload vs eager and references. This area is fraught with problems composing scopes together and subtle things can alter the SQL generated and number of queries issued and even what actual results are returned. I noticed differences in minor point releases of AR 4 yielding different query results because of the join strategy AR would choose.
Often you want to aim to reduce the number of SQL's issued to the database but joining tables is not always the best solution. You will need to benchmark and use EXPLAIN to see what works better for your queries. Sometimes sub queries/sub-selects can be more efficient.
Querying by parent_id is a good option if you can get Ember Data to perform the request that way as the database has a simpler query.
You could consider using Ember-Model instead of Ember-Data, I am using it currently as its much simpler and easier to adapt to my needs, and supports multi-fetch to avoid 1+N request problems.
You may be able to use embedded models or side-loaded models so your server can reduce the number of web requests AND the number of SQLs and return what the client needs in one request / one SQL. Ember-Model supports both embedded and side-loaded models, so Ember-Data being more ambitious may also.
Although it appears from your question that Ember-Data is doing a multi-fetch, make sure you are doing SQL IN clause for those ID's instead of separate queries.
Make sure that the SQL on your rails side is not fanning out in a 1+N pattern. Using the includes options to effect eager loading on AR relations may help avoid 1+N queries or it may unnecessarily load models depending on the results needed in your response.
I also found that the Ruby JSON serializer libraries are less than optimal. I created a gem ToJson that speeds up JSON serializing many times over the existing solutions. You can try it and benchmark for yourself.
I found that ActiveRecord (including AR 4) didn't work well for me and I moved to Sequel in the end because it gave me so much more control over join types, join conditions, and query composition and tactical eager loading, plus it was just faster, has wider support for standard SQL features and excellent support for postgres features and extensions. These things can make a huge difference to the way you design your database schema and the performance and types of queries you can achieve.
Using Sequel and ToJson I can serve around 30-50 times more requests than I could with ActiveRecord + JBuilder for most of my queries, and in some instances its hundreds times better than what I was achieving with AR (especially creates/updates). Besides Sequel being faster at instantiating models from the DB, it also has a Postgres streaming adapter which makes it even faster again for large results.
Changing your data access/ORM layer and JSON serialisation can achieve 30-50 times faster performance or alternatively require managing 30-50 less servers for the same load. It's nothing to sneeze at.

Related

NodeJS/Mongoose - Logical separation of same schema + multi-tenancy

I have 2 requirements in my application:
I have multiple clients, which should be completely separated
Each client can have multiple subsidiaries that he should be able to switch between without re-authenticating but the data should be separated (e.g. all vendors in subsidiary 1 should not be shown in subsidiary 2)
As for the first requirement, I'm thinking of using a multi-tenancy architecture. That is, there will be one API instance, one frontend instance per customer and one database per customer. Each request from the frontend will include a tenant ID by which the API decides which database it needs to connect to / use. I would use mongoose's useDb method for this.
Question 1: is this method a good approach and/or are there any known drawbacks performance wise? I'm using this article as a reference.
As for the second requirement, I would need to somehow logically separate certain schemas. E.g., I have my mongoose vendorSchema. But I would need to somehow separate the entries per subsidiary. I could only imagine to add a field to each of these "shared schemas" e.g.
const vendorSchema = new mongoose.Schema({
/* other fields... */
subsidiary {
type: mongoose.Schema.Types.ObjectId,
ref: "Subsidiary",
required: true
}
})
and then having to use this a subsidiary in every request to the API to use in the mongoose query to find the right data. That seems like a bad architectural decision and an overhead though, and seems little scalable.
Question 2: Is there a better approach to achieve this logical separation as per subsidiary for every "shared" schema?
Thanks in advance for any help!

To maybe answer part of your question..
A multi tenant application is, well normal.. I honestly don't know of any web-app that would be single tenant, unless it's just a personal app.
With that said the architecture you have will work but as noted in my comments there is no need to have a separate DB for each users, this would be a bit overkill and is the reason why SQL or Mongo queries exist.
Performance wise, in general database servers are very performant, that's what they are designed for, but this will rely on many factors
Number of requests
size of requests
DB optimization
Query optimization
Resources of DB server
I'm sure there are many more I didn't list but you get the idea..
To your second question, yes you could add a 'Subsidiary' field, this would say be the subsidiary ID, so then when you query Mongo you use where subsidiar = 'id' this would then return only the items for said user...
From the standpoint of multiple request to mongo for each API call, yah you want to try and limit the number of calls each time but thats where caching comes in, using something like redis to store the responses for x minutes etc. Then the response is mainly handled by redis, but again this is going to depend a lot on the size of the responses and frequency etc.
But this actually leads into why I was asking about DB choices, Mongo works really well for frequently changing schemas with little to no relation to each other. We use Mongo for an a chat application and it works really well for that because it's more or less just a JSON store for us with simply querying for chats but the second you need to have data relate to each other it can start to get tricky and end up costing you more time and resources trying to hack around mongo to do the same task.
I would say it could be worth doing an exercise where you look at your current data structure, where it is today and where it might go in the future. If you can foresee having your data related in anyway in the future or maybe even crypto ( yes mongo does have this but its only in the enterprise version) then it may be something to look at.

How to handle an extremely big table in a search?

I'm looking for suggestions on how to go about handling the following use case scenario with python django framework, i'm also open to using javascript libraries/ajax.
I'm working with pre-existing table/model called revenue_code with over 600 million rows of data.
The user will need to search three fields within one search (code, description, room) and be able to select multiple search results similar to kendo controls multi select. I first started off by combining the codes in django-filters as shown below, but my application became unresponsive, after waiting 10-15 minutes i was able to view the search results but couldn't select anything.
https://simpleisbetterthancomplex.com/tutorial/2016/11/28/how-to-filter-querysets-dynamically.html
I've also tried to use kendo controls, select2, and chosen because i need the user to be able to select as many rev codes as they need upward to 10-20, but all gave the same unresponsive page when it attempted to load the data into the control/multi-select.
Essentially what I'm looking for is something like this below, which allows the user to select multiple selections and will handle a massive amount of data without becoming unresponsive? Ideally i'd like to be able to query my search without displaying all the data.
https://petercuret.com/add-ajax-to-django-without-writing-javascript/
Is Django framework meant to handle this type of volume. Would it be better to export this data into a file and read the file? I'm not looking for code, just some pointers on how to handle this use case.

What the basic mechanism of "searching 600 millions"? Basically how database do that is to build an index, before search-time, and sufficiently general enough for different types of query, and then at search time you just search on the index - which is much smaller (to put into memory) and faster. But no matter what, "searching" by its nature, have no "pagination" concept - and if 600 millions record cannot go into memory at the same time, then multiple swapping out and in of parts of the 600 millions records is needed - the more parts then the slower the operation. These are hidden behind the algorithms in databases like MySQL etc.
There are very compact representation like bitmap index which can allow you to search on data like male/female very fast, or any data where you can use one bit per piece of information.
So whether Django or not, does not really matters. What matters is the tuning of database, the design of tables to facilitate the queries (types of indices), and the total amount of memory at server end to keep the data in memory.
Check this out:
https://dba.stackexchange.com/questions/20335/can-mysql-reasonably-perform-queries-on-billions-of-rows
https://serverfault.com/questions/168247/mysql-working-with-192-trillion-records-yes-192-trillion
How many rows are 'too many' for a MySQL table?

You can't load all the data into your page at once. 600 million records is too many.
Since you mentioned select2, have a look at their example with pagination.
The trick is to limit your SQL results to maybe 100 or so at a time. When the user scrolls to the bottom of the list, it can automatically load in more.
Send the search query to the server, and do the filtering in SQL (or NoSQL or whatever you use). Database engines are built for that. Don't try filtering/sorting in JS with that many records.

Established methods for websites to serve database results

Are there established methods or frameworks for serving database rows to web clients? So far I just have them submit a JSON object ex
{
Query: "SELECT",
Schema: "icecream",
Table: "cones",
Fields: ["price", "flavor"],
Filters: [
{
"Comparison": "=",
"Field": "flavor",
"Value": "chocolate"
}
]
}
I verify that the fields mentioned are authorized/correct, then construct a prepared statement mysql string, but are there any frameworks, or standard methods of implementing this?

Truth be told, what you are doing is very unusual. You are effectively giving your web client full access to build queries for your database. You mention that you are validating the information before building the query, which is good, because the risk of SQL injection is very high. You are asking a very broad question, so I will respond with an equally broad answer:
No, there are no frameworks or standards to implementing this. The reason is because (unless you have a very specific reason for doing this), very few web applications are setup to give the web client such extensive control over the queries being built. Normally your backend APIs would intentionally be much more limited. You are effectively implementing an API method that says:
Give me the details of your query and I will build and execute it for you and return the result.
Normally standard operating procedure otherwise is to have much more specific and limited API methods. Rather than having a generic query builder you would have an API for each specific thing that has to happen:
Tell me how many records you want and a search value on this small handful of fields and I will return a list of matching users
Tell me how many records you want and which of these fields you want to sort by and I will return a list of matching books
That is not to say that there aren't perfectly valid reasons to do it the way you are trying to do it, but unless there is a reason why you specifically want to give users full control over the query building process, I think the first step is to refactor in a way that gives the web client less control, and your server-side application more.

It looks like you're reinventing a query language!
Why not allow your users to type SQL queries directly? Or mongodb queries, or whatever DBMS you use. There is much less overhead.
When it comes to security, a good practice is to setup a clone of your database (a read-only replica), and have your clients hit the read-only replica instead of the main database node.
Your main node and your read-only replica can stay in sync using replication. Any good DBMS supports it.
A good example of this is the Stack Exchange Data Explorer

Processing a large (12K+ rows) array in JavaScript

The project requirements are odd for this one, but I'm looking to get some insight...
I have a CSV file with about 12,000 rows of data, approximately 12-15 columns. I'm converting that to a JSON array and loading it via JSONP (has to run client-side). It takes many seconds to do any kind of querying on the data set to returned a smaller, filtered data set. I'm currently using JLINQ to do the filtering, but I'm essentially just looping through the array and returning a smaller set based on conditions.
Would webdb or indexeddb allow me to do this filtering significantly faster? Any tutorials/articles out there that you know of that tackles this particular type of issue?

http://square.github.com/crossfilter/ (no longer maintained, see https://github.com/crossfilter/crossfilter for a newer fork.)
Crossfilter is a JavaScript library for exploring large multivariate
datasets in the browser. Crossfilter supports extremely fast (<30ms)
interaction with coordinated views, even with datasets containing a
million or more records...

This reminds me of an article John Resig wrote about dictionary lookups (a real dictionary, not a programming construct).
http://ejohn.org/blog/dictionary-lookups-in-javascript/
He starts with server side implementations, and then works on a client side solution. It should give you some ideas for ways to improve what you are doing right now:
Caching
Local Storage
Memory Considerations

If you require loading an entire data object into memory before you apply some transform on it, I would leave IndexedDB and WebSQL out of the mix as they typically both add to complexity and reduce the performance of apps.
For this type of filtering, a library like Crossfilter will go a long way.
Where IndexedDB and WebSQL can come into play in terms of filtering is when you don't need to load, or don't want to load, an entire dataset into memory. These databases are best utilized for their ability to index rows (WebSQL) and attributes (IndexedDB).
With in browser databases, you can stream data into a database one record at a time and then cursor through it, one record at a time. The benefit here for filtering is that this you means can leave your data on "disk" (a .leveldb in Chrome and .sqlite database for FF) and filter out unnecessary records either as a pre-filter step or filter in itself.

Handling large grid datasets in JavaScript

What are some of the better solutions to handling large datasets (100K) on the client with JavaScript. In particular, if you have multi-column sort and search capabilities, how do you handle fetching (and pre-fetching) the data, client side model binding (for display), and caching the data.
I would imagine a good solution would be doing some thoughtful work in the background. For instance, initially, if the table was displaying N items, it might fetch 2N items, return the data for the user, and then go fetch the next 2N items in the background (even if the user hasn't requested this). As the user made search/sort changes, it would throw out (or maybe even cache the initial base case), and do similar functionality.
Can you share the best solutions you have seen?
Thanks

Use a jQuery table plugin like DataTables: http://datatables.net/
It supports server-side processing for sorting, filtering, and paging. And it includes pipelining support to prefetch the next x pages of records: http://www.datatables.net/examples/server_side/pipeline.html
Actually the DataTables plugin works 4 different ways:
1. With an HTML table, so you could send down a bunch of HTML and then have all the sorting, filtering, and paging work client-side.
2. With a JavaScript array, so you could send down a 2D array and let it create the table from there.
3. Ajax source - which is not really applicable to you.
4. Server-side, where you send data in JSON format to an empty table and let DataTables take it from there.

SlickGrid does exactly what you're looking for. (Demo)
Using the AJAX data store, SlickGrid can handle millions of rows without flinching.

Since you tagged this with Ext JS, I'll point you to Ext.ux.LiveGrid if you haven't already seen it. The source is available, so you might have a look and see how they've addressed this issue. This is a popular and widely-used extension in the Ext JS world.
With that said, I personally think (virtually) loading that much data is useless as a user experience. Manually pulling a scrollbar around (jumping hundreds of records per pixel) is a far inferior experience to simply typing what you want. I'd much prefer some robust filtering/searching instead of presenting that much data to the user.
What if you went to Google and instead of a search box, it just loaded the entire internet into one long virtual list that you had to scroll through to find your site... :)

It depends on how the data will be used.
For a large dataset, where the browser's Find function was adequate, just returning a straight HTML table was effective. It takes a while to load, but the display is responsive on older, slower clients, and you never have to worry about it breaking.
When the client did the sorting and search, and you're not showing the entire table at once, I had the server send tab-delimited tables through XMLHTTPRequest, parsed them in the browser with list = String.split('\n'), and updated the display with repeated calls to $('node').innerHTML = 'blah'. The JS engine can store long strings pretty efficiently. That ran a lot faster on the client than showing, hiding, and rearranging DOM nodes. Creating and destroying new DOM nodes on the fly turned out to be really slow. Splitting each line into fields on-demand seems to work; I haven't experimented with that degree of freedom.
I've never tried the obvious pre-fetch & background trick, because these other methods worked well enough.

Check out this comprehensive list of data grids and
spreadsheets.
For filtering/sorting/pagination purposes you may be interested in great Handsontable, or DataTables as a free alternative.
If you need simply display huge list without any additional features Clusterize.js should be sufficient.

Develop Reference

JavaScript is the programming language of the Web.