Friday, December 26, 2014

Nodejs event loop and I/O threads

Everything runs parallel except your code

Nodejs has an asynchronous event loop running on a single thread.  It's waken up by OS for incoming requests and farms out I/O events to a thread pool in a non-blocking style and then responses are sent back to the event loop via callback.  Thus the event loop never blocks on I/O.

Below is a diagram Aaron Stannard used to illustrate the Nodejs processing model.

Nodejs' event loop is build using the default loop of libuv (uv_default_loop().)  libuv was an abstraction around libev or IOCP (Windows) providing an API based on libev to talk to the kernel event notification mechanisms.  Since node v0.9.0, libev was removed.  It's developed for use by Node.js as a multi-platform support library with a focus on async I/O.

From libuv's github page, it stated that it suports, 
1. event loop backed by epoll, kqueue, IOCP, event ports
2. async tcp & udp sockets
3. async dns resolution
4. async file & file system operations (using fs.openSync is very wrong!)
5. file system events
6. IPC with socket sharing
7. child processes
9. signal handling

As you can see all the I/O is ran on the worker threads and not block the main thread (the event loop.)  i.e all your code is run inside the event loop.  And since there's only one event loop, so everything runs parallel in nodejs except your code.

That's why it's important for the application to be I/O bounded rather than CPU bounded as a highly CPU bounded function (of course, we're not talking about implementing fibonacci as Ted did) will block the event loop, stopping it from accepting the next request, and yes, it'll be queued (by the OS i presume) and if it's long enough, it'll get timeout instead of performance degrading gracefully with load.  It works like cooperative multitasking that a particular request's operation will continue until it cedes control back to the event loop. 

And for CPU intensive tasks, we can

1. yield the process by using process.nextTick() which we have to break the process into smaller chunk so after processing one chunk, the next chunk is typically runs in the next loop around the event loop and typically runs before any other I/O event fire.  However it's still possible to stave the event loop by putting too many callbacks with process.nextTick(), we can use process.maxTickDepth to limit the number of process.nextTick() to be evaluated before other I/O event.

2. leverage node-webworker-threads library to offload works to threads if it's pure javascript and create a pool of child processes for it.

3. implement custom event and extension to libuv but we'll have to build the task in C++.

Now since almost everything is running on a single thread (even though I/O are offloaded to a thread pool, for most applications, that's just a small portion of CPU time) and nodejs will run on a single core no matter how many cores our server has.  One solution would be starting up multiple instance of nodejs but that would introduce a bit maintenance and each instance have to listen to a different port.  And nodejs provided a clustering solution which the master process would listen to the port and forward the request in a round-robin fashion to the child processes. 

Note that clustering is not the solution to processing CPU intensive tasks.  Each child process will have its own event loop and if that event loop is blocked, the master don't know that it's blocked and next request will still be queued. 

Also, if we are using Mesos with docker, we just have to define the number of CPU required for the application is 1 (or 2?) and we can run multiple instances of the application on the same host easily.

Another solution is to try out JXCore which is a multithreaded port of nodejs.  
It even claims it "performs even better with application requiring more CPU power"

By default, libuv has 4 I/O threads.  For most of the time, it's good enough.  If your application has slow I/O, all 4 I/O threads could be locked up.  We could set environment variable UV_THREADPOOL_SIZE to increase the number of IO threads if the slow IO is just the way it is.  However, if it's abnormal for that IO to be slow (due to heavy loading or some other failure), we should set timeout instead.

Note: our application use request a lot to call other services, however in the documentation, it appears it only have timeout for the response but not the connection timeout.

At last, we'll need a tool to monitor the event loop and i/o threads to better tune our application.  How do we know if the little thing we do between IO is blocking the event loop enough that request start queuing up and that the other CPU is idling?  or that all 4 of the io threads are blocked and even though request is accepted but nothing is being processed.  (note, our application has clustering and once we had a issue with turning on debug logging, probably the disk is too slow to write all those logs, all io threads are used up and the cluster will kill and restart the child process!)  Strongloop has a monitoring solution that could monitor the event loop, gc, cpu and memory usage.

Bert Belder - libuv -