I'm trying to improve the performance of a script when executed in a web worker. It's designed to parse large text files in the browser without crashing. Everything works pretty well, but I notice a severe difference in performance for large files when using a web worker.
So I conducted a simple experiment. I ran the script on the same input twice. The first run executed the script in the main thread of the page (no web workers). Naturally, this causes the page to freeze and become unresponsive. For the second run, I executed the script in a web worker.
Script being executed
Test runner page
For small files in this experiment (< ~100 MB), the performance difference is negligible. However, on large files, parsing takes about 20x longer in the worker thread:
The blue line is expected. It should only take about 11 seconds to parse the file, and the performance is fairly steady:
The red line is the performance inside the web worker. It is much more surprising:
The jagged line for the first 30 seconds is normal (the jag is caused by the slight delay in sending the results to the main thread after every chunk of the file is parsed). However, parsing slows down rather abruptly at 30 seconds. (Note that I'm only ever using a single web worker for the job; never more than one worker thread at a time.)
I've confirmed that the delay is not in sending the results to the main thread with postMessage(). The slowdown is in the tight loop of the parser, which is entirely synchronous. For reasons I can't explain, that loop is drastically slowed down and it gets slower with time after 30 seconds.
But this only happens in a web worker. Running the same code in the main thread, as you've seen above, runs very smoothly and quickly.
Why is this happening? What can I do to improve performance? (I don't expect anyone to fully understand all 1,200+ lines of code in that file. If you do, that's awesome, but I get the feeling this is more related to web workers than my code, since it runs fine in the main thread.)
System: I'm running Chrome 35 on Mac OS 10.9.4 with 16 GB memory; quad-core 2.7 GHz Intel Core i7 with 256 KB L2 cache (per core) and L3 Cache of 6 MB. The file chunks are about 10 MB in size.
Update: Just tried it on Firefox 30 and it did not experience the same slowdown in a worker thread (but it was slower than Chrome when run in the main thread). However, trying the same experiment with an even larger file (about 1 GB) yielded significant slowdown after about 35-40 seconds (it seems).
Tyler Ault suggested one possibility on Google+ that turned out to be very helpful.
He speculated that using FileReaderSync in the worker thread (instead of the plain ol' async FileReader) was not providing an opportunity for garbage collection to happen.
Changing the worker thread to use FileReader asynchronously (which intuitively seems like a performance step backwards) accelerated the process back up to just 37 seconds, right where I would expect it to be.
I haven't heard back from Tyler yet and I'm not entirely sure I understand why garbage collection would be the culprit, but something about FileReaderSync was drastically slowing down the code.
What hardware are you running on? You may be running into cache thrashing problems with your CPU. For example if the CPU cache is 1MB per core (just an example) and you start trying to work with data continually replacing the cache (cache misses) then you will suffer slow downs - this is quite common with MT systems. This is common in IO transfers too. Also these systems tend to have some OS overheads for the thread contexts as well. So if lots of threads are being spawned you may be spending more time managing the contexts than the thread is 'doing work'. I haven't yet looked at your code, so I could be way off - but my guess is on the memory issue just due to what your application is doing. :)
Oh. How to fix. Try making the blocks of execution small single chunks that match the hardware. Minimize the amount of threads in use at once - try to keep them 2-3x the amount of cores you have in the hardware (this really depends what sort of hw you have). Hope that helps.
Related
I'm making an online browser game with websockets and a node server and if I have around 20-30 players, the CPU is usually around 2% and RAM at 10-15%. I'm just using a cheap Digital Ocean droplet to host it.
However, every 20-30 minutes it seems, the server CPU usage will spike to 100% for 10 seconds, and then finally crash. Up until that moment, the CPU usually hovering around 2% and the game is running very smoothly.
I can't tell for the life of me what is triggering this as there are no errors in the logs and nothing in the game that I can see causes it. Just seems to be a random event that brings the server down.
There are also some smaller spikes as well that don't bring the server down, but soon resolve themselves. Here's an image:
I don't think I'm blocking the event loop anywhere and I don't have any execution paths that seem to be long running. The packets to and from the server are usually two per second per user, so not much bandwidth used at all. And the server is mostly just a relay with little processing of packets other than validation so I'm not sure what code path could be so intensive.
What can I do to profile this and find out where to begin in how to investigate what are causing these spikes? I'd like to imagine there's some code path I forgot about that is surprisingly slow under load or maybe I'm missing a node flag that would resolve it but I don't know.
I think I might have figured it out.
I'm using mostly websockets for my game and I was running htop and noticed that if someone sends large packets (performing a ton of actions in a short amount of time) then the CPU spikes to 100%. I was wondering why that was when I remembered I was using a binary-packer to reduce bandwidth usage.
I tried changing the parser to JSON instead so as to not compress and pack the packets and regardless of how large the packets were the CPU usage stayed at 2% the entire time.
So I think what was causing the crash was when one player would send a lot of data in a short amount of time and the server would be overwhelmed with having to pack all of it and send it out in time.
This may not be the actual answer but it's at least something that needs to be fixed. Thankfully the game uses very little bandwidth as it is and bandwidth is not the bottleneck so I may just leave it as JSON.
The only problem is that with JSON encoding that users can read the packets in the Chrome developer console network tab which I don't like.. Makes it a lot easier to find out how the game works and potentially find cheats/exploits..
I have a problem using PhantomJS with web server module in a multi-threaded way, with concurrent requests.
I am using PhantomJS 2.0 to create highstock graphs on the server-side with Java, as explained here (and the code here).
It works well, and when testing graphs of several sizes, I got results that are pretty consistent, about 0.4 seconds to create a graph.
The code that I linked to was originally published by the highcharts team, and it is also used in their export server at http://export.highcharts.com/. In order to support concurrent requests, it keeps a pool of spawned PhantomJS processes, and basically its model is one phantomjs instance per concurrent request.
I saw that the webserver module supports up to 10 concurrent requests (explained here), so I thought I can tap on that to keep a lesser number of PhantomJS processes in my pool. However, when I tried to utilize more threads, I experienced a linear slow down, as if PhantomJS was using only one CPU. This slow-down is shown as follows (for a single PhantomJS instance):
1 client thread, average request time 0.44 seconds.
2 client threads, average request time 0.76 seconds.
4 client threads, average request time 1.5 seconds.
Is this a known limitation of PhantomJS? Is there a way around it?
(question also posted here)
Is this a known limitation of PhantomJS?
Yes, it is an expected limitation, because PhantomJS uses the same WebKit engine for everything and since JavaScript is single-threaded, this effectively means that every request will be handled one after the other (possibly interlocked), but never at the same time. The average overall time will increase linearly with each client.
The documentation says:
There is currently a limit of 10 concurrent requests; any other requests will be queued up.
There is a difference between the notions of concurrent and parallel requests. Concurrent simply means that the tasks finish non-deterministically. It doesn't mean that the instructions that the tasks are made of are executed in parallel on different (virtual) cores.
Is there a way around it?
Other than running your server tasks through child_process, no. The way JavaScript supports multi-threading is by using Web Workers, but a worker is sandboxed and has no access to require and therefore cannot create pages to do stuff.
I am in the process of profiling a javascript library I wrote, looking for memory leaks. The library provides an API and service to a back-end. It does not do any html or dom manipulation. It does not load any resources (images, etc). The only thing it does is make xhr requests (using jquery), including one long poll, and it passes and receives data to and from the UI via events (using a Backbone event bus).
I have tested this library running it overnight for 16 hours. The page that loads it does nothing but load the library and sends a login request to start the service. There were no html, css, or other dom changes over the course of the test.
All that happened over the course of the test was the library sent a heartbeat (xhr request) to the server every 15 seconds, and received a heartbeat via the long poll every 30 seconds.
I ran the test with the chrome task manager open, and with the chrome debugger open, in order to force GCs from the timeline.
At the start of the tests, after I forced an initial GC - these were the stats from the chrome task manager:
Memory - 11.7mb
Javascript memory - 6.9 mb / 2.6mb live
Shared memory - 21.4 mb
Private memory 19mb
After 16hrs I forced a GC - these were the new stats:
Memory - 53.8mb
Javascript memory - 6.9 mb / 2.8mb live
Shared memory - 21.7 mb
Private memory 60.9mb
As you can see the JS heap grew by only 200k.
Private memory grew by 42mb!
Can anyone provide possible explanations of what caused the private memory growth? Does having the chrome debugger open cause or affect the memory growth?
One other thought I had is that forcing the GC from the timeline debugger only reclaimed memory from the JS heap - so other memory was not reclaimed. Therefore this may not be a 'leak' per se, as it may eventually be collected - though I'm not sure how to confirm this. Especially without knowing what this memory represents.
Lastly, I did read that xhr results are also stored in private memory. Does anyone know if this is true? If so, the app did perform approx 5700 xhr requests over this timeframe. If most of the 42mb was due to this, that would mean approx 7k was allocated - which seems high, considering the payloads were very small. If this is the case; when would this memory be released, is there anything I can do to cause it to be released, and would having the chrome debugger open impact this?
I was unable to find information on precisely when and what goes to private memory versus javascript memory. However I can answer this: "would having the chrome debugger open impact this" ... YES.
Having the developer tools opened causes the browser to gather/retain/display a lot more information about each and every XHR that is being made. This data is not gathered/retained when the developer tools are closed (as most know since it is sometimes bloody annoying when you open the dev tools to late and it missed that one request you cared about).
You can open the dev tools, trigger the GC, then close the dev tools, opening them only again to trigger the GC when you want to take the metric. Also you can use this sucker chrome://memory-redirect/ to keep track of the growth without opening the dev tools.
I'm considering using Web Workers to for batch image processing and am wondering what to expect in terms of performance gains.
My current strategy is to process each image in sequence and only start a new process after the current process is over. If I have 10 images that take 10 seconds each to process, the batch will complete in ~100 seconds.
If I use 10 Web Workers at once, I doubt I will complete the entire job in 10 seconds. But will it be lower than 100 seconds? If not, is there an optimal size to a pool of concurrently running Web Workers?
I would imagine that your performance gains will depend heavily on the number of cores you have in your computer. If I were to guess, I'd say a good sweet spot might be four web workers (corresponding to quad-core machines), but the only way to know for sure is to try it out.
Structure your code in such a way that you can simply change a constant to change the number of workers, and then set it to a value that seems to work optimally.
You can try to experiment with this example: Ray-tracing with Web Workers It certainly seems there is a huge performance gain in using 16 vs. 4 web workers here.
I have a touch screen kiosk application that I'm developing that will be deployed on the latest version of Chrome.
The application will need to make AJAX calls to an web service, every 10 minutes or so to pull thru any updated content.
As it's a kiosk application, the page is unlikely to be reloaded very often and theoretically, unless the kiosk is turned off, the application could run for days at a time.
I guess my concern is memory usage and whether or not a very long running setTimeout loop would chew through a large amount of memory is given sufficient time.
I'm currently considering the use of Web Workers and I'm also going to look into Web Sockets but I was wondering if anyone had any experience with this type of thing?
Cheers,
Terry
The browser has a garbage collector so no problems on that. as long as you don't introduce memory leaks through bad code. here's an article and another article about memory leak patterns. that's should get you started on how to program efficiently, and shoot those leaky code.
also, you have to consider the DOM. a person in SO once said that "things that are not on screen should be removed and not just hidden" - this not only removes the entity in a viewing perspective, but actually removes it from the DOM, remove it's handlers, and the memory it used will be freed.
As for the setTimeout, lengthen the interval between calls. Too fast, you will chew up memory fast (and render the page quite... laggy). I just tested code for a timer-based "hashchange" detection, and even on chrome, it does make the page rather slow.
research on the bugs of chrome and keep updated as well.