When some cached value is expired or new cache will be generated for any reason and we have a huge traffic at the time of no cache exists, there will be a heavy load on MongoDB and response time significantly increases. This is typically called "Dog-pile effect". Everything works well after cache is created.
I know that it's a very common problem which applies to all web applications using a database & cache system.
What should one do to avoid dog-pile effect at a Node.js & MongoDB & Redis stack? What are best practices and common mistakes?
One fairly proven way to keep the dogs from piling up is to keep a "lock" (e.g. in Redis) that prevents the cache populating logic from firing up more than once. The first time that the fetcher is called (for a given piece of content), the lock is acquired (for it) and set to expire (e.g. with SET ... NX EX 60). Any subsequent invocation of the fetcher for that content will fail on getting the lock thus only one dog gets to the pile.
The other thing you may want to put into place is some kind of rate limiting on the fetcher, regardless the content. That's also quite easily doable with Redis - feel free to look it up or ask another question :)
Id just serve expired content until new content is done caching so that database wont get stampede.
Related
In a web app, using pouchDB, I have a slow running function that finishes by updating a document in the DB. I want to move it off the main UI thread and into a web worker. However, we have lots of other code using pouchDB still in the main thread (e.g. the change event listener, but also code that deals with other documents). (For reference the database size is on the order of 100MB; Vue2 is used so, in general, the UI can update when the data changes.)
This is where I seem to come unstuck immediately:
Shared memory is basically out, as all the browsers disable it by default
Even if it wasn't, pouchDB is a class, and cannot be transferred(?).
Isolating all the db code, including the changes handler, into one web worker is a huge refactor; and then we still have the issue of having to pass huge chunks of data in and out of that web worker.
Move all the code that uses the data into the web worker too, and just have the UI thread pass messages back and forth, is an even bigger refactor, and I've not thought through how it might interfere with Vue.
That seems to leave us with a choice of two extremes. Either rewrite the whole app from the ground up, possibly dropping Vue, or just do the slow, complex calculation in a web worker, then have it pass back the result, and continue to do the db.put() in the main UI thread.
Is it really an all or nothing situation? Are there any PouchDB "tricks" that allow working with web workers, and if so will we need to implement locking?
You're missing an option, that I would chose in your situation. Write a simple adapter that allows your worker code to query the DB in main thread via messages. Get your data, process it in the worker and send it back.
You only need to "wrap" the methods that you need in the worker. I recommend writing a class or set of functions that are async in your worker, to make the code readable.
You don't need to worry about the amount of data passed. The serialization and de-serialization is quite fast and the transfer is basically memcpy, so that does not take any reasonable time.
I found this adapter plugin, which I guess counts as the "PouchDB trick" I was after: https://github.com/pouchdb-community/worker-pouch
It was trivial to add (see below), and has been used in production for 6-7 weeks, and appears to have fixed the problems we saw. (I say appears, as it is quite hard to see it having any effect, and we didn't have a good way to reproduce the slowdown problems users were seeing.)
const PouchDB = require('pouchdb-browser').default
const pouchdbWorker = require('worker-pouch')
PouchDB.adapter('worker', pouchdbWorker)
The real code is like this, but usePouchDBWorker has always been kept as true:
const PouchDB = require('pouchdb-browser').default
// const pouchdbDebug = require('pouchdb-debug')
if (usePouchDBWorker) {
const pouchdbWorker = require('worker-pouch')
PouchDB.adapter('worker', pouchdbWorker)
}
This code is used in both web app and Electron builds. The web app is never used with older web browsers, so read the github site if that might be a concern for your own use case.
Is it possible to identify updates on the server, that's, has updates on the page (HTML) or on the style (CSS) and make a request to server to get updated data? If so, how?
Shortly;
When your service worker script change, it will set to replace the old one user has as soon as user gets online with your PWA.
You should reflect changes to the array of items you have in your SW.
You should change the name of the cache, and remove the old cache.
As the result, your users will always have the latest version of your app.
With details;
1) Change your SW to replace the one user currently have.
Even slightest change in your SW is enough for it to be considered as a new version, so it will take over the old one. This quote from Google Web Developers explains it;
Update your service worker JavaScript file. When the user navigates to
your site, the browser tries to redownload the script file that
defined the service worker in the background. If there is even a
byte's difference in the service worker file compared to what it
currently has, it considers it new.
Best practice would be keeping a version number anywhere in your SW, and updating it programmatically as the content change. It could even be quoted, does not matter, it will still work. Remember: even a byte's difference is enough.
2) Reflect your changes to the list of items to be cached.
You keep a list of assets to be cached in your SW. So reflect added/updated/removed assets to your list. Best practice would be setting up cache buster for your assets. That is because SW is located a layer behind the browser cache, so to speak.
In another words: fetch requests from SW goes through browser cache. So SW will pick new assets from browser cache instead of the server, and cache those until SW changes again.
In that case you will end up half of your users using your PWA with new assets while the other half is suffering from incredulous bugs. And you would be having wonderful time with over-exposure to complaints and frustration of being unable to find the cause or a way to reproduce those.
3) Replace your old cache, do not merge it.
If you do not change the name of the cache for your updated list of assets, those will be merged with old ones.
Merge will be happen in a way that old and removed assets will be kept while new ones added and changed ones will be replaced. While things will seem to work okay this way, you will be accumulating old assets on users' devices. And space you are storing is not infinite.
4) Everyone is happy
It may look tedious to implement and keep track of all the things mentioned but I assure you that otherwise you will start having way greater and way more unpleasant things to deal with. And unsatisfied users will be a nice plus.
So I encourage you to design your SW for once and for good.
Yes it is possible by invalidating the cache in your service worker:
https://developers.google.com/web/fundamentals/getting-started/primers/service-workers#update-a-service-worker
Note also that at the moment there is an open issue and Service worker JavaScript update frequency (every 24 hours?) as the service worker may not be updated due to browser cache.
However you may also want to take a look at sw-precache which does this for you (for example through a gulp task)
Have a look at LiveJS to dynamically update the css.
I believe the solution they use is to add a get parameter with a timestamp to the css or html page: /css/style.css?time=1234 and calculate the hash of the result. If the hash changed since the last verification, update the CSS, otherwise, keep looking.
A similar implementation could be built from HTML, but I have not seen any similar projects for it. You should have a look at Ajax if you want to automatically update data in your page.
I have a UI autosuggest component that performs an AJAX request as user types. For example, if user types mel, the response could be:
{
suggestions: [{
id: 18,
suggestion: 'Melbourne'
}, {
id: 7,
suggestion: 'East Melbourne'
}, {
id: 123,
suggestion: 'North Melbourne'
}]
}
The UI component implements client side caching. So, if user now clicks b (results for melb are retrieved), and then Backspace, the browser already has results for mel in memory, so they are immediately available. In other words, every client makes at most one AJAX call for every given input.
Now, I'd like to add server side caching on top of this. So, if one client performs an AJAX call for mel, and let's say there is some heavy computation going on to prepare the response, other clients would be getting the results without executing this heavy computation again.
I could simply have a hash of queries and results, but I'm not sure that this is the most optimal way to achieve this (memory concerns). There are ~20000 suggestions in the data set.
What would be the best way to implement the server side caching?
You could implement a simple cache with an LRU (least recently used) discard algorithm. Basically, set a few thresholds (for example: 100,000 items, 1 GB) and then discard the least recently used item (i.e., the item that is in cache but was last accessed longer ago than any of the other ones). This actually works pretty well, and I'm sure you can use an existing Node.js package out there.
If you're going to be building a service that has multiple frontend servers, it might be easier and simpler to just set up memcached on a server (or even put it on a frontend server if you have a relatively low load on your server). It's got an extremely simple TCP/IP protocol and there's memcached clients available for Node.js.
Memcached is easy to set up and will scale for a very long time. Keeping the cache on separate servers also has the potential benefit of speeding up requests for all frontend instances, even the ones that have not received a particular request before.
No matter what you choose to do, I would recommend keeping the caching out of the process that serves the requests. That makes it easy to just kill the cache if you have caching issues or need to free up memory for some reason.
(memory concerns). There are ~20000 suggestions in the data set.
20,000 results? Have you thought about home much memory that will actually take? My response is assuming you're talking about 20,000 short strings as presented in the example. I feel like you're optimizing for a problem you don't have yet.
If you're talking about a reasonably static piece of data, just keep it in memory. Even if you want to store it in a database, just keep it in memory. Refresh it periodically if you must.
If it's not static, just try and read it from the database on every request first. Databases have query caches and will chew through a 100KB table for breakfast.
Once you're actually getting enough hits for this to become an actual issue, don't cache it yourself. I have found that if you actually have a real need for a cache other people have written it better than you would have. But if you really need one, go for an external one like Memcached or even something like Redis. Keeping that stuff external can makes testing and scalability a heap easier.
But you'll know when you actually need a cache.
I have a node server for loading certain scripts that can be written by anyone. I understand that when I fire up my Node server, modules load for the first time in the global scope. When one requests a page, it gets loaded by the "start server" callback; and I can use all the already loaded modules per request. But I haven't encountered a script where global variables get changed during request time and affects every single other instance in the process (maybe there is).
My question is, how safe is it, in terms of server crashing, to alter the global data? Also, suppose that that I have written a proper locking mechanism that will "pause" the server for all instances for a very short amount of time until the proper data is loaded.
Node.js is single threaded. So it's impossible for two separate requests to alter a global variable simultaneously. So in theory, it's safe.
However, if you're doing stuff like keep user A's data temporarily in a variable and then when user A later submits another request use that variable be aware that user B may make a request in between potentially altering user A's data.
For such cases keeping global values in arrays or objects is one way of separating user data. Another strategy is to use a closure which is a common practice in callback-intensive or event/promise oriented libraries such as socket.io.
When it comes to multithreading or multiprocessing, message passing style API like node's built-in cluster module has the same guarantees of not clobbering globals since each process have its own global. There are several multithreading modules that's implemented similarly - one node instance per thread. However, shared memory style APIs can't make such guarantees since each thread is now a real OS thread which may preempt each other and clobber each others memory. So if you ever decide to try out one of the multithreading modules, be aware of this issue.
It is possible to implement fake shared memory using message passing though - sort of like how we do it with ajax or socket.io. So I'd personally avoid shared memory style multithreading unless I really, really need to cooperatively work on a very large dataset that would bog down message passing architectures.
Then again, remember, the web is a giant message passing architecture with the messages being HTML and XML and json. So message passing scales to Google size.
I've been getting more and more into high-level application development with JavaScript/jQuery. I've been trying to learn more about the JavaScript language and dive into some of the more advanced features. I was just reading an article on memory leaks when i read this section of the article.
JavaScript is a garbage collected language, meaning that memory is allocated to objects upon their creation and reclaimed by the browser when there are no more references to them. While there is nothing wrong with JavaScript's garbage collection mechanism, it is at odds with the way some browsers handle the allocation and recovery of memory for DOM objects.
This got me thinking about some of my coding habits. For some time now I have been very focused on minimizing the number of requests I send to the server, which I feel is just a good practice. But I'm wondering if sometimes I don't go too far. I am very unaware of any kind of efficiency issues/bottlenecks that come with the JavaScript language.
Example
I recently built an impound management application for a towing company. I used the jQuery UI dialog widget and populated a datagrid with specific ticket data. Now, this sounds very simple at the surface... but their is a LOT of data being passed around here.
(and now for the question... drumroll please...)
I'm wondering what the pros/cons are for each of the following options.
1) Make only one request for a given ticket and store it permanently in the DOM. Simply showing/hiding the modal window, this means only one request is sent out per ticket.
2) Make a request every time a ticket is open and destroy it when it's closed.
My natural inclination was to store the tickets in the DOM - but i'm concerned that this will eventually start to hog a ton of memory if the application goes a long time without being reset (which it will be).
I'm really just looking for pros/cons for both of those two options (or something neat I haven't even heard of =P).
The solution here depends on the specifics of your problem, as the 'right' answer will vary based on length of time the page is left open, size of DOM elements, and request latency. Here are a few more things to consider:
Keep only the newest n items in the cache. This works well if you are only likely to redisplay items in a short period of time.
Store the data for each element instead of the DOM element, and reconstruct the DOM on each display.
Use HTML5 Storage to store the data instead of DOM or variable storage. This has the added advantage that data can be stored across page requests.
Any caching strategy will need to consider when to invalidate the cache and re-request updated data. Depending on your strategy, you will need to handle conflicts that result from multiple editors.
The best way is to get started using the simplest method, and add complexity to improve speed only where necessary.
The third path would be to store the data associated with a ticket in JS, and create and destroy DOM nodes as the modal window is summoned/dismissed (jQuery templates might be a natural solution here.)
That said, the primary reason you avoid network traffic seems to be user experience (the network is slower than RAM, always). But that experience might not actually be degraded by making a request every time, if it's something the user intuits involves loading data.
I would say number 2 would be best. Because that way if the ticket changes after you open it, that change will appear the second time the ticket is opened.
One important factor in the number of redraws/reflows that are triggered for DOM manipulation. It's much more efficient to build up your content changes and insert them in one go than do do it incrementally, since each increment causes a redraw/reflow.
See: http://www.youtube.com/watch?v=AKZ2fj8155I to better understand this.