Node server randomly spikes to 100% then crashes. How to diagnose?

Node server randomly spikes to 100% then crashes. How to diagnose? - javascript

I'm making an online browser game with websockets and a node server and if I have around 20-30 players, the CPU is usually around 2% and RAM at 10-15%. I'm just using a cheap Digital Ocean droplet to host it.
However, every 20-30 minutes it seems, the server CPU usage will spike to 100% for 10 seconds, and then finally crash. Up until that moment, the CPU usually hovering around 2% and the game is running very smoothly.
I can't tell for the life of me what is triggering this as there are no errors in the logs and nothing in the game that I can see causes it. Just seems to be a random event that brings the server down.
There are also some smaller spikes as well that don't bring the server down, but soon resolve themselves. Here's an image:
I don't think I'm blocking the event loop anywhere and I don't have any execution paths that seem to be long running. The packets to and from the server are usually two per second per user, so not much bandwidth used at all. And the server is mostly just a relay with little processing of packets other than validation so I'm not sure what code path could be so intensive.
What can I do to profile this and find out where to begin in how to investigate what are causing these spikes? I'd like to imagine there's some code path I forgot about that is surprisingly slow under load or maybe I'm missing a node flag that would resolve it but I don't know.

I think I might have figured it out.
I'm using mostly websockets for my game and I was running htop and noticed that if someone sends large packets (performing a ton of actions in a short amount of time) then the CPU spikes to 100%. I was wondering why that was when I remembered I was using a binary-packer to reduce bandwidth usage.
I tried changing the parser to JSON instead so as to not compress and pack the packets and regardless of how large the packets were the CPU usage stayed at 2% the entire time.
So I think what was causing the crash was when one player would send a lot of data in a short amount of time and the server would be overwhelmed with having to pack all of it and send it out in time.
This may not be the actual answer but it's at least something that needs to be fixed. Thankfully the game uses very little bandwidth as it is and bandwidth is not the bottleneck so I may just leave it as JSON.
The only problem is that with JSON encoding that users can read the packets in the Chrome developer console network tab which I don't like.. Makes it a lot easier to find out how the game works and potentially find cheats/exploits..

Related

Node.js and fragmentation

Background: I came from Microsoft world, in which I used to have websites stored on IIS. Experience taught me to recycle my application pool once a day in order to eliminate weird problems due to fragmentation. Recycling the app pool basically means to restart your application without restarting the entire IIS. I also watched a lecture that explained how Microsoft had reduced the fragmentation a lot in .Net 4.5.
Now, I'm deploying a Node.js application to production environment and I have to make sure that it works flawlessly all the time. I originally thought to make my app restarted once a day. Then I did some research in order to find some clues about fragmentation problems in Node.js. The only thing I've found is a scrap of paragraph from an article describing GC in V8:
To ensure fast object allocation, short garbage collection pauses, and
the “no memory fragmentation V8” employs a stop-the-world,
generational, accurate, garbage collector.
This statement is really not enough for me to give up building a restart mechanism for my app, but on the other hand I don't want to do some work if there is no problem.
So my quesion is:
Should or shouldn't I restart my app every now and then in order to prevent fragmentation?

Implementing a server restart before you know that memory consumption is indeed a problem is a premature optimization. As such, I don't think you should do it until you actually find that it is a problem. You will likely find more important issues to optimize for as opposed to memory consumption.
To figure out if you need a server restart, I recommend doing the following:
Set up some monitoring tools like https://newrelic.com/ that let's your monitor your performance.
Monitor your memory continuously. Try to see if there is steady increase in the amount of memory consumed, or if it levels off.
Decide upon an acceptable threshold before you need to act. For example once your app consumes 60% of system memory you need to start thinking about a server restart and decide upon the restart interval.
Decide if you are ok with having "downtime" while restarting the sever or not. If you don't want downtime, you may need to build a proxy layer to direct traffic.
In general, I'd recommend server restarts for all dynamic, garbage collected languages. This is fairly common in those types of large applications. It is almost inevitable that a small mistake somewhere in your code base, or one of the libraries you depend on will leak memory. Even if you fix one leak, you'll get another one eventually. This may frustrate your team, which will basically lead to a server restart policy, and a definition of what is acceptable in regards to memory consumption for your application.

I agree with #Parris. You should probably figure out whether you actually need have a restart policy first. I would suggest using pm2 docs here. Even if you don't want to sign up for keymetrics, its a pretty good little process manager and real quick to set up. You can get a report of memory usage from command line. Looks something like this.
Also, if you start in cluster mode like above, you can call pm2 restart my_app and the first one will probably be up again before the last one is taken offline (this is an added benefit, the real reason for having 8 processes is to utilize all 8 cores). If you are adamant about downtime, you could restart them 1 by 1 acording to id.

I agree with #Parris this seems like a premature optimization. Also, restarting is not a solution to the underlying problem, it's a treatment for the symptoms.
If memory errors are a prevalent issue for your node application then I think that some thought as to why this fragmentation occurs in your program in the first place could be a valuable effort. Understanding why memory errors occur after a program has been running for a long period of time, and refactoring the architecture of your program to solve the root of the problem, is a better solution in my eyes than just addressing the symptoms.
I believe two things will benefit you.
immutable objects will help a lot, they are a lot more predictable than using mutable objects, and will not be affected by the length of time the project has been live. Also, since immutable objects are read only blocks of memory they are faster than mutable objects which the server has to spend resources deciding whether to read, or write on the memory block which stores the object. I currently use the library called IMMUTABLE and it works well for me. There are other one's as well like Deep Freeze, however, I have never used it.
Make sure to manage your application's processes correctly, memory leaks are the second big contributor to this problem that I have experienced. Again, this is solved by thinking about how your application is structured, and how user events are handled, making sure once a process is not being used by the client that it is properly removed from the heap, if it is not then the heap keeps growing until all memory is consumed causing the application to crash(refer to the below graphic to see V8's memory Scheme, and where the heap is). Node is a C++ program, and it's controlled by Google's V8 and Javascript.
You can use Node.js's process.memoryUsage() to monitor memory usage. When you identify how to manage your heap V8 offers two solutions, one is Scavenge which is very quick, but incomplete. The other is Mark-Sweep which is slow and frees up all non-referenced memory.
Refer to this blog post for more info on how to manage your heap and manage your memory on V8 which runs Node.js
So the responsible approach to your implementation is to keep a close eye on open processes, a deep understanding of the heap, and how to free non-referenced memory blocks. Creating your project with this in mind also makes the project a lot more scaleable as well.

Report browser crash to server

I want to add a feature to my web app which may cause the browser to run out of memory and crash for computers low on RAM. Is there any way that I can report when this happens to the server to determine how many users this happens to?
EDIT:
The specific problem is that the amount of memory used scales with the amount the user is zoomed in. To cope with this we will limit how much users can zoom in. We have an idea of how much to limit it based on in-house tests, but we want to know if the browser tab crashes for any users after we release this.
We could reduce the memory consumption with zooming, but doing so would entail a tremendous amount of refactoring, and would probably have negative, and possibly unacceptable performance losses.
Note that this is not "we know this happens, but we want to know how much", and more of "we don't think this will happen, but we want to know if it does."

There is no real way to get this information. If a server stops receiving data (sent by a websocket or such) via something such as a crash it has no way of telling what caused the stop. Either a browser tab being purposefully closed or a crash.
You could potentially check speed of a loop and report it to the server - such as frame-rate in a requestAnimationFrame - and check for trends if you are a fall before a close - but this is laden with pitfalls and potentially adds work to the client that may only add to your intensive code.

Why does web worker performance sharply decline after 30 seconds?

I'm trying to improve the performance of a script when executed in a web worker. It's designed to parse large text files in the browser without crashing. Everything works pretty well, but I notice a severe difference in performance for large files when using a web worker.
So I conducted a simple experiment. I ran the script on the same input twice. The first run executed the script in the main thread of the page (no web workers). Naturally, this causes the page to freeze and become unresponsive. For the second run, I executed the script in a web worker.
Script being executed
Test runner page
For small files in this experiment (< ~100 MB), the performance difference is negligible. However, on large files, parsing takes about 20x longer in the worker thread:
The blue line is expected. It should only take about 11 seconds to parse the file, and the performance is fairly steady:
The red line is the performance inside the web worker. It is much more surprising:
The jagged line for the first 30 seconds is normal (the jag is caused by the slight delay in sending the results to the main thread after every chunk of the file is parsed). However, parsing slows down rather abruptly at 30 seconds. (Note that I'm only ever using a single web worker for the job; never more than one worker thread at a time.)
I've confirmed that the delay is not in sending the results to the main thread with postMessage(). The slowdown is in the tight loop of the parser, which is entirely synchronous. For reasons I can't explain, that loop is drastically slowed down and it gets slower with time after 30 seconds.
But this only happens in a web worker. Running the same code in the main thread, as you've seen above, runs very smoothly and quickly.
Why is this happening? What can I do to improve performance? (I don't expect anyone to fully understand all 1,200+ lines of code in that file. If you do, that's awesome, but I get the feeling this is more related to web workers than my code, since it runs fine in the main thread.)
System: I'm running Chrome 35 on Mac OS 10.9.4 with 16 GB memory; quad-core 2.7 GHz Intel Core i7 with 256 KB L2 cache (per core) and L3 Cache of 6 MB. The file chunks are about 10 MB in size.
Update: Just tried it on Firefox 30 and it did not experience the same slowdown in a worker thread (but it was slower than Chrome when run in the main thread). However, trying the same experiment with an even larger file (about 1 GB) yielded significant slowdown after about 35-40 seconds (it seems).

Tyler Ault suggested one possibility on Google+ that turned out to be very helpful.
He speculated that using FileReaderSync in the worker thread (instead of the plain ol' async FileReader) was not providing an opportunity for garbage collection to happen.
Changing the worker thread to use FileReader asynchronously (which intuitively seems like a performance step backwards) accelerated the process back up to just 37 seconds, right where I would expect it to be.
I haven't heard back from Tyler yet and I'm not entirely sure I understand why garbage collection would be the culprit, but something about FileReaderSync was drastically slowing down the code.

What hardware are you running on? You may be running into cache thrashing problems with your CPU. For example if the CPU cache is 1MB per core (just an example) and you start trying to work with data continually replacing the cache (cache misses) then you will suffer slow downs - this is quite common with MT systems. This is common in IO transfers too. Also these systems tend to have some OS overheads for the thread contexts as well. So if lots of threads are being spawned you may be spending more time managing the contexts than the thread is 'doing work'. I haven't yet looked at your code, so I could be way off - but my guess is on the memory issue just due to what your application is doing. :)
Oh. How to fix. Try making the blocks of execution small single chunks that match the hardware. Minimize the amount of threads in use at once - try to keep them 2-3x the amount of cores you have in the hardware (this really depends what sort of hw you have). Hope that helps.

WebRTC - remove/reduce latency between devices that are sharing their videos stream?

i'm sorry for not posting any code, but i'm trying learning more about latency and webRTC , what is the best way to remove latency between two or more devices that are sharing a video stream?
Or , anyway, to reduce as much as possible latency ?
Thinking about it, i imaged to just put the device's clocks to the same time so delay the requests from server, is this the real trick?

Latency is a function of the number of steps on the path between the source (microphone, camera) and the output (speakers, screen).
Changing clocks will have zero impact on latency.
The delays you have include:
device internal delays - waiting for screen vsync, etc...; nothing much to be done here
device interface delays - a shorter cable will save you some time, but nothing measureable
software delays - your operating system and browser; you might be able to do something here, but you probably don't want to, trust that your browser maker is doing what they can
encoding delays - a faster processor helps a little here, but the biggest concern is for things like audio, the encoder has to wait for a certain amount of audio to accumulate before it can even start encoding; by default, this is 20ms, so not much; eventually, you will be able to request shorter ptimes (what the control is called) for Opus, but it's early days yet
decoding delays - again, a faster processor helps a little, but there's not much to be done
jitter buffer delay - audio in particular requires a little bit of extra delay at a receiver so that any network jitter (fluctuations in the rate at which packets are delivered) doesn't cause gaps in audio; this is usually outside of your control, but that doesn't mean that it's completely impossible
resynchronization delays - when you are playing synchronized audio and video, if one gets delayed for any reason, playback of the other might be delayed so that the two can be played out together; this should be fairly small, and is likely outside of your control
network delays - this is where you can help, probably a lot, depending on your network configuration
You can't change the physics of the situation when it comes to the distance between two peers, but there are a lot of network characteristics that can change the actual latency:
direct path - if, for any reason, you are using a relay server, then your latency will suffer, potentially a lot, because every packet doesn't travel directly, it takes a detour via the relay server
size of pipe - trying to cram high resolution video down a small pipe can work, but getting big intra-frames down a tiny pipe can add some extra delays
TCP - if UDP is disabled, falling back to TCP can have a pretty dire impact on latency, mainly due to a special exposure to packet loss, which requires retransmissions and causes subsequent packets to be delayed until the retransmission completes (this is called head of line blocking); this can also happen in certain perverse NAT scenarios, even if UDP is enabled in theory, though most likely this will result in a relay server being used
larger network issues - maybe your peers are on different ISPs and those ISPs have poor peering arrangements, so that packets take a suboptimal route between peers; traceroute might help you identify where things are going
congestion - maybe you have some congestion on the network; congestion tends to cause additional delay, particularly when it is caused by TCP, because TCP tends to fill up tail-drop queues in routers, forcing your packets to wait extra time; you might also be causing self-congestion if you are using data channels, the congestion control algorithm there is the same one that TCP uses by default, which is bad for real-time latency
I'm sure that's not a complete taxonomy, but this should give you a start.

I don't think you can do something to enhance the latency besides being on a better network with higher bandwidth and latency. If you're on the same network or wifi, there should be quite little latency.
I think the latency also is higher when your devices have little processing power, so they don't decode the video as fast, but there should be not much you can do about that it happens all in the Browser.
What you could do is try different codecs. Therefore you have to manipulate the SDP before it is sent out and reorder or remove the codecs in the m=audio or the m=video line. (But there's not much to choose from in video codecs, just VP8)
You can view the performance of the codecs and network on the tool that comes with chrome:
chrome://webrtc-internals/
just type that into the URL-bar.

Best practice: very long running polling processes in Javascript?

I have a touch screen kiosk application that I'm developing that will be deployed on the latest version of Chrome.
The application will need to make AJAX calls to an web service, every 10 minutes or so to pull thru any updated content.
As it's a kiosk application, the page is unlikely to be reloaded very often and theoretically, unless the kiosk is turned off, the application could run for days at a time.
I guess my concern is memory usage and whether or not a very long running setTimeout loop would chew through a large amount of memory is given sufficient time.
I'm currently considering the use of Web Workers and I'm also going to look into Web Sockets but I was wondering if anyone had any experience with this type of thing?
Cheers,
Terry

The browser has a garbage collector so no problems on that. as long as you don't introduce memory leaks through bad code. here's an article and another article about memory leak patterns. that's should get you started on how to program efficiently, and shoot those leaky code.
also, you have to consider the DOM. a person in SO once said that "things that are not on screen should be removed and not just hidden" - this not only removes the entity in a viewing perspective, but actually removes it from the DOM, remove it's handlers, and the memory it used will be freed.
As for the setTimeout, lengthen the interval between calls. Too fast, you will chew up memory fast (and render the page quite... laggy). I just tested code for a timer-based "hashchange" detection, and even on chrome, it does make the page rather slow.
research on the bugs of chrome and keep updated as well.

Develop Reference

JavaScript is the programming language of the Web.