Architecting Kubernetes for High Availability, Fault Tolerance and Business Continuity

Kubernetes can take care of many things, and can solve many problems except the ones it doesn't know about such as region failure and human errors.

In this post I want to compare and contrast the differences between Single Cluster setup spread across multi Availability Zones that is very common vs Multi Cluster Setup Spread across different Region. Hopefully by then end of this post you have some clue about when to use which setup no matter which cloud provider you are using; be it AWS, Azure, or GCP.

Single Cluster Setup:

In this Setup the Kubernetes nodes and their storages are distributed across multiple Availability Zones (AZ). This model ensures the nodes are physically separated from each other and the outage in one of the AZ will not cause the entire cluster to go out of service. At the same time the communication between each node is via private connection and does not route over internet no matter which cloud provider you use.

I took the following image which I from official Azure documentation illustrate how cross AZ looks like - but it's pretty much the same if you are on AWS, GKE.

On Azure Setting up a cluster that spread cross multi region is almost same as setting up a cluster within one AZ. The only thing you need to do is to click on the AZ check-box, and that's it. Azure will take care of the rest of configuration for you.

AWS also recommends to run EKS cluster spread across three or more Availability Zones. And setting an EKS cluster that spread between EKS requires minimal configuration and same goes for Google GKE.

Multi-cluster Setup

Let's change gear and talk about the second approach. In the multi-cluster setup, instead of having a single cluster spread around several availability zones, we have multiple independent clusters across same or different regions running across several AZs. In this model, each cluster is running completely independent the other one and they are not necessary aware of each others existence. You then have to leverage a traffic router at DNS level such as Route53 (for AWS), Azure Traffic manager (for Azure), or Cloud DNS (for GCP) in order to load balance across the several Clusters that you have provisioned.

I took above image from official Azure docs demonstrate how multi cluster works, the setup would be pretty much similar across AWS and GCP.

Advantages of Multi-Cluster Setup:

This model increases the Availability and Fault Tolerance of your system as compared to the first model, because even if you loose an entire K8 cluster in one region, you still can operate normally by routing the traffic to the healthy Cluster running in another region. You might ask yourself how often a region would fail on any cloud providers? not that often I know, but the risk is not limited to cloud provider region failure, there is even a great risk that you loose your cluster as a results of executing bad configuration by the cluster administrator. Utilizing this model would allow business continuity, whereas if you use single cluster you would immediately loose all the app running until you figure out what have you done wrongly!

The advantage that multi-cluster setup goes beyond just high availability, it can also be leveraged to reduce latency, let's say your application has visitors from all around the world, in the single cluster setup, you are forced to process each request within a single cluster that you have no matter where the origin of that request is, in the multi-region setup however, you can easily introduce routing policy in the cloud DNS router to send traffic to specific cluster depending on the request origin. For example, a request from U.S goes to the cluster deployed in U.S region, and a request originated from Europe gets processed by the Cluster running in Europe region. This can help to reduce the network latency and partitions.

Conclusion

So which Setup should I use? well, that's a million dollar question, but I would say if you are not running a mission critical system and you can afford some potential down-time then Single Cluster is more hassle-free. But if Business Continuity is one of your top priorities then go with multi-region. Keep in mind that using Multi-region increases the cost, and also the maintenance. It also makes your CI | CD pipeline more complicated as you need to deploy the same app into more than just one cluster.

Search This Blog

Benyamin's playground