Resilient
The system stays responsive in the face
of failure.
This applies not only to highly-available, mission critical systems —
any system that is not resilient will be unresponsive after a
failure. Resilience is achieved by replication,
containment, isolation and delegation.
Failures are contained within each component,
isolating components from each other and thereby ensuring that parts
of the system can fail and recover without compromising the system as
a whole. Recovery of each component is delegated to another
(external) component and high-availability is ensured by replication
where necessary. The client of a component is not burdened with
handling its failures.
Since each
service has its own single responsibility and is independent.
Failure is contained and isolated in one service. If one of the
service is down, it won’t affect the rest of the site.
Failure is
treated as a first class citizen.
- Redundancy for machine/ cluster/ data center failures
- Load-balancing and flow control for service invocations
- Client protects itself with standard failure management patterns: timeouts, retries, circuit breakers.
Resiliency has to
be tested too. Resiliency test or game day should be execute on the
production system, making sure the servers response well under
failure.
Api gateway (netflix)/ api facades (500px)
Api
gateway/facades owned by web team which basically a proxy to the
various backend systems. Since its own by the web team, there’s no
need idle time to wait for the service team to build the service.
Also, it functions as an abstract concurrency, which can pull data
from various services and flatten them.
Isolate/access
Persistence
between services should not be shared. This break the independency
in both development and deployment of a service. However, this is
one of the aspect of microservice that practitioners most struggled
with, esp those with existing monolithic server which share the same
database.
Loadbalancer
Using smart
client loadbalancer. Netflix has opensourced Ribbon for this
purpose. It can avoid the whole zone if it’s go bad.
Resiliency/Availability
Microservice does
not automatically mean better availability. If a service failed, the
consumer service can go into a loop and soon all available thread
will be consumed. Employ circuit breaker, retries, bulk heading and
fallbacks. Hystrix from Netflix provides supports for these
patterns.
Test service for resiliency
Test for
resiliency as failure in production would happens. Need to test for
latency/error (Simian Army), dependency service unavailability and
network errors. The system should be optimized for responding to
failures.
Rest.li + deco
Rest.li is the
RESTful services framework from LinkedIn which also give type-safe
bindings, async, non-blocking I/O. It also provides dynamic service
discovery and client side load balancing using zookeeper.
All domain model
in LinkedIn is normalized, the link between two domain model is by
using URN, a fully quantified foreign key (e.g. urn:li:member:123).
Deco is a URN resolution library. Services can query Deco by what
data it needs (not how) and Deco can look into the URN and fetch the
data for the service.
Dynamic load balancer
Mailgun
implemented a dynamic loadbalancer which basically includes service
registration/discovery, load balancing, circuit breaker (retries) and
canary deployment into one. New service can simply start and will
register to the load balancer and other services can start consuming
it. Multiple version of services can be run at the same time, thus
enabling canary deployment. Without disrupting the existing
services, we can start new version of the service and then start the
consumer with will use the new version, so we can retire old service
gracefully.
Elastic
The system stays responsive under varying
workload. Reactive Systems can react to changes in the input rate by
increasing or decreasing the resources allocated
to service these inputs. This implies designs that have no
contention points or central bottlenecks, resulting in the
ability to shard or replicate components and distribute inputs
among them. Reactive Systems support predictive, as well as Reactive,
scaling algorithms by providing relevant live performance
measures. They achieve elasticity in
a cost-effective way on commodity hardware and software platforms.
Goal of an
elastic system is to:
- Scale up and down service instances according to load
- Gracefully handle spiky workloads
- Predictive and reactive scaling
Resource contention
It happens when
there is a conflict over access to a shared resource.
- Message-driven/async. Service should only hold on to a resource when it’s using it. The traditional BIO will block the thread when it’s waiting for some I/O operations, so the resource is not release and can cause contention.
- Reactive all the way down. However, in some case it’s not possible. The best example is the use of JDBC. There’s no async driver (except for an experimental driver for Oracle and mysql.) This is also one of the reason moving to NoSQL. One possible way to deal with that is to use Actor model.
- Beside idle threads, shared/distributed state also constitute contention point. It should be moved to the edge, like the client or external datastore, to reduce contention.
Fan out
One of the
challenges of implementing microservice is fan out, both on the UI
and among services. Fan out on the UI can be resolved by using API
gateway which push the fan out problem to the server (with lower
network latency) by returning a materialized view instead. Server
caching can be used to reduce the amount of load to the services and
instead of caching a service, we can cache the composite/materialized
view.
Bottlenecks/hotspots
Services like
user service might be required by multiple services in the same flow.
Those services become the hotspot of the system. To reduce the
dependency, we can pass data through header instead of having all
individual services depends on it.
Message driven
Reactive Systems rely
on asynchronous message-passing
to establish a boundary between components that ensures loose
coupling, isolation, location
transparency, and provides the means to delegate errors
as messages. Employing explicit message-passing enables load
management, elasticity, and flow control by shaping and monitoring
the message queues in the system and applying back-pressure when
necessary. Location transparent messaging as a means of communication
makes it possible for the management of failure to work with the same
constructs and semantics across a cluster or within a single
host. Non-blocking
communication allows recipients to only consume resources while
active, leading to less system overhead.
Message-driven is
- Asynchronous message-passing over synchronous request-response
- Often custom protocol over TCP/UDP or WebSockets over HTTP
Backpressure
An example of
backpressure is user uploading file. Traditional, the web server
will save it to a local file, and then upload to the storage (express
is doing this.) With streaming, the web server can pipe the file to
the storage while the user is uploading it. But what If the user
upload faster than the server can save to the storage, backpressure
is needed to control the flow. Stream can be throttled the upload
from the user, so it’ll be the same speed to store the file. So
backpressure is
- Stream-oriented
- Sensible behavior on resource contention
- Avoids unnecessary buffering
Back to Rx.
There’s 2 kinds of observable, hot & cold.
Hot observable
Typically we’ll
use temporal operator – sample (take a sample of the stream in a
time window), throttleFirst (take the first value throttled in a time
window), debounce (wait until it pause a bit), buffer (base on time)
or window (by time or count). Or combining buffer and debounce.
Reactive pull
Dynamic
push-pull. Push (reactive) when consumer keeps up with producer &
switch to pull (interactive) when consumer is slow. Bound all*
queues, vertically not horizontally. Backpressure of a given stream
not all stream together.
Cold stream
support pull by definition as we can control the speed. Hot receives
signal to support pull.
stream.onBackpress(buffer|drop|sample|scaleHorizontally).observeOn(aScheduler)
So on
backpressure we can use the backpressure signal to choose different
strategy. We can buffer on memory, on Kafka or on disk. We can drop
or sample the stream or to scale it horizontally (Netflix can stand a
new server on Mesos).
No comments:
Post a Comment