Architectural Design Patterns: Cluster Immune System
- Posted by Daitan Innovation Team
- On February 26, 2022
- Design & Architecture, Software Development, Zero Downtime Deployment
With new architectural design patterns and strategies, software deployment is becoming increasingly more complex. This increased complexity combined with a large number of users brings more edge cases and problems.
Imagine that you developed a new super cool service, which uses distributed computing techniques to deal with thousands of simultaneous users. A week passes and you receive feedback and decide to incorporate it into your service. You work on the changes and decide to deploy the new version, which leads you to two options:
- Destroy all nodes and deploy the new version
- Progressively deploy nodes with the new version, shifting users to them.
Since you don’t want to annoy your users with downtime, you decide for the latter and start the process immediately. Your first node gets up and running correctly, so you decide to take a walk while the process continues.
Upon returning to your desk, you get absolutely confused by hundreds of emails and lost calls, only to discover that you made a small mistake and all nodes are crashing every couple of minutes, losing data and making your customers angry. Murphy’s law at its best.
This situation could be avoided if the deployment tool had access to logs and other information to halt the process when necessary. But there are techniques even more advanced, like the one that we are going to talk about today, the Cluster Immune System.
If a new version of an application does not meet performance targets, or if a problem is detected, the release process is halted without human intervention, and code is automatically rolled back to the previous version. The deployment remains locked while the problem is investigated.
How it Works
This technique has a two-fold structure: the first one is the monitoring infrastructure; the second one being the connection to the deployment procedure. This obviously means that it will work differently depending on the deployment technique being used.
Given that the current set of relevant metrics are already being collected, the remaining action is to define the expected values for the metrics, from which deployment decisions will be made. Those could be error rates in APls, latency, usage statistics (e.g. number of logged users, transaction rate), or any other relevant metric for any stakeholder. Look for common errors caused by previous updates for good candidates. In the case of Canary Deployments or Rolling Updates, the cluster immune system will be used to monitor the new version and control the percentage of users being migrated to it. The following decision table illustrated how the measurements could impact the deployment process:
|Signal||Measured Value||Continue Rollout||Alert Ops||Rollback|
Depending on the maturity of the deployment process, other decisions could be linked to the metrics, such as update rate or the algorithm to select the set of users receiving the new version.
- A good, reliable set of metrics is necessary to use this technique
- Problems with the monitoring solution might affect the rollout process
When to Use
This technique is usable whenever an automated deployment strategy is being used, as well as a reliable monitoring system is in place. The level of automation (regarding the deployment control) can be selected according to the maturity level of the deployment process. Rollbacks are especially difficult to handle when data schema changes are part of the update, so triggering it automatically could be risky.
Adopting in a Greenfield
The first step to adopt this technique is to actually adopt one of the required deployment processes such as Canary Deployment or Rolling Updates.
It’s also important to make the definition of which signals will be monitored and acted upon in an update during the design of every feature. Also, one needs to assure that the deployment process for every component created takes as a requirement all the automatic actions that can be triggered by the immune system.
The monitoring system design also needs to be robust enough so that the immune system can take actions on error signals.
Adopting in a Brownfield
The first step in adopting this technique is to make sure the deployment process is mature enough so that it can be automated. This can be done by incrementally introducing the immune system with the following steps:
- Select candidate metrics from the monitoring system. You can use historical failure-after-update data to get good candidates.
- Track the quality of the candidate metrics emitted by the monitoring system: fixing false alarms and adding missing ones.
- Implement a dry-mode for the immune system, so that decisions can be tracked without actually being automatically executed.
- Implement a manual gate-keeper for the actions, so that the operation staff can review and approve (apply) the actions.
- Apply this mechanism in a pre-production environment, where you have the same configuration and a controlled environment to validate each new iteration of the process evolution.
- Promote stable/reliable actions to be fully automated.
- Good metrics are absolutely important. Take your time to select the best for this technique
- Rollbacks are especially difficult if you have changes in data structure. Make sure that you can revert changes in your database schema for example
- This technique works best with small but continuous changes. If you are planning to make a major one, check other options (or make sure that you have taken all precautions)
This article was written by João Pedro São Gregório Silva, from Daitan Innovation Team and co-authored by Isac Sacchi Souza, Principal DevOps Specialist, Systems Architect & member of the Daitan Technology Council. Thanks to João Augusto Caleffi and the SRE/DevOps Community of Practice for reviews and insights.