Friday, December 26, 2014

Reactive Microservice Part 3

Resilient

The system stays responsive in the face of failure. This applies not only to highly-available, mission critical systems — any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation. Failures are contained within each component, isolating components from each other and thereby ensuring that parts of the system can fail and recover without compromising the system as a whole. Recovery of each component is delegated to another (external) component and high-availability is ensured by replication where necessary. The client of a component is not burdened with handling its failures.
Since each service has its own single responsibility and is independent. Failure is contained and isolated in one service. If one of the service is down, it won’t affect the rest of the site.
Failure is treated as a first class citizen.
  • Redundancy for machine/ cluster/ data center failures
  • Load-balancing and flow control for service invocations
  • Client protects itself with standard failure management patterns: timeouts, retries, circuit breakers.
Resiliency has to be tested too. Resiliency test or game day should be execute on the production system, making sure the servers response well under failure.

Api gateway (netflix)/ api facades (500px)

Api gateway/facades owned by web team which basically a proxy to the various backend systems. Since its own by the web team, there’s no need idle time to wait for the service team to build the service. Also, it functions as an abstract concurrency, which can pull data from various services and flatten them.

Isolate/access

Persistence between services should not be shared. This break the independency in both development and deployment of a service. However, this is one of the aspect of microservice that practitioners most struggled with, esp those with existing monolithic server which share the same database.

Loadbalancer

Using smart client loadbalancer. Netflix has opensourced Ribbon for this purpose. It can avoid the whole zone if it’s go bad.

Resiliency/Availability

Microservice does not automatically mean better availability. If a service failed, the consumer service can go into a loop and soon all available thread will be consumed. Employ circuit breaker, retries, bulk heading and fallbacks. Hystrix from Netflix provides supports for these patterns.

Test service for resiliency

Test for resiliency as failure in production would happens. Need to test for latency/error (Simian Army), dependency service unavailability and network errors. The system should be optimized for responding to failures.

Rest.li + deco

Rest.li is the RESTful services framework from LinkedIn which also give type-safe bindings, async, non-blocking I/O. It also provides dynamic service discovery and client side load balancing using zookeeper.
All domain model in LinkedIn is normalized, the link between two domain model is by using URN, a fully quantified foreign key (e.g. urn:li:member:123). Deco is a URN resolution library. Services can query Deco by what data it needs (not how) and Deco can look into the URN and fetch the data for the service.

Dynamic load balancer

Mailgun implemented a dynamic loadbalancer which basically includes service registration/discovery, load balancing, circuit breaker (retries) and canary deployment into one. New service can simply start and will register to the load balancer and other services can start consuming it. Multiple version of services can be run at the same time, thus enabling canary deployment. Without disrupting the existing services, we can start new version of the service and then start the consumer with will use the new version, so we can retire old service gracefully.


Elastic

The system stays responsive under varying workload. Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs. This implies designs that have no contention points or central bottlenecks, resulting in the ability to shard or replicate components and distribute inputs among them. Reactive Systems support predictive, as well as Reactive, scaling algorithms by providing relevant live performance measures. They achieve elasticity in a cost-effective way on commodity hardware and software platforms.
Goal of an elastic system is to:
  • Scale up and down service instances according to load
  • Gracefully handle spiky workloads
  • Predictive and reactive scaling

Resource contention

It happens when there is a conflict over access to a shared resource.
  • Message-driven/async. Service should only hold on to a resource when it’s using it. The traditional BIO will block the thread when it’s waiting for some I/O operations, so the resource is not release and can cause contention.
  • Reactive all the way down. However, in some case it’s not possible. The best example is the use of JDBC. There’s no async driver (except for an experimental driver for Oracle and mysql.) This is also one of the reason moving to NoSQL. One possible way to deal with that is to use Actor model.
  • Beside idle threads, shared/distributed state also constitute contention point. It should be moved to the edge, like the client or external datastore, to reduce contention.

Fan out

One of the challenges of implementing microservice is fan out, both on the UI and among services. Fan out on the UI can be resolved by using API gateway which push the fan out problem to the server (with lower network latency) by returning a materialized view instead. Server caching can be used to reduce the amount of load to the services and instead of caching a service, we can cache the composite/materialized view.

Bottlenecks/hotspots

Services like user service might be required by multiple services in the same flow. Those services become the hotspot of the system. To reduce the dependency, we can pass data through header instead of having all individual services depends on it.

Message driven

Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages. Employing explicit message-passing enables load management, elasticity, and flow control by shaping and monitoring the message queues in the system and applying back-pressure when necessary. Location transparent messaging as a means of communication makes it possible for the management of failure to work with the same constructs and semantics across a cluster or within a single host. Non-blocking communication allows recipients to only consume resources while active, leading to less system overhead.
Message-driven is
  • Asynchronous message-passing over synchronous request-response
  • Often custom protocol over TCP/UDP or WebSockets over HTTP

Backpressure

An example of backpressure is user uploading file. Traditional, the web server will save it to a local file, and then upload to the storage (express is doing this.) With streaming, the web server can pipe the file to the storage while the user is uploading it. But what If the user upload faster than the server can save to the storage, backpressure is needed to control the flow. Stream can be throttled the upload from the user, so it’ll be the same speed to store the file. So backpressure is
  • Stream-oriented
  • Sensible behavior on resource contention
  • Avoids unnecessary buffering
Back to Rx. There’s 2 kinds of observable, hot & cold.

Hot observable

Typically we’ll use temporal operator – sample (take a sample of the stream in a time window), throttleFirst (take the first value throttled in a time window), debounce (wait until it pause a bit), buffer (base on time) or window (by time or count). Or combining buffer and debounce.

Reactive pull

Dynamic push-pull. Push (reactive) when consumer keeps up with producer & switch to pull (interactive) when consumer is slow. Bound all* queues, vertically not horizontally. Backpressure of a given stream not all stream together.
Cold stream support pull by definition as we can control the speed. Hot receives signal to support pull.
stream.onBackpress(buffer|drop|sample|scaleHorizontally).observeOn(aScheduler)

So on backpressure we can use the backpressure signal to choose different strategy. We can buffer on memory, on Kafka or on disk. We can drop or sample the stream or to scale it horizontally (Netflix can stand a new server on Mesos).