What Is High Availability?
Highly available(HA) systems are reliable in the way that they continue business operations even when critical components fail. They are also resilient, which means that they are able to easily handle failure without services disruption or any type of data loss, and easily recover from such failure.
High availability is usually measured as a percentage of uptime. The number of “nines” is commonly used to indicate the degree of high availability. A system that is up 99.99% of the time, means it is down for only 52.6 minutes during an entire year.
Let's talk about a few of the best practices we can use to design our systems for high availability.
- Fault Isolation Zones: Fault isolation zones take the form of Availability Zones and regions. So let’s say a one particular data center fails, then by leveraging multiple data centers, multiple Availability Zones, that failure is isolated to that one and our application can continue running because we’re leveraging multiple zones.
- Redundancy: The more components we have the less impact any one component failing will have on the overall application.
- Recovery-Oriented Computing: We can also leverage a design pattern that we call as Recovery-Oriented Computing. So instead of really spending a lot of time and energy and effort into maintaining the underlying infrastructure, if we leverage disposable resources and simply throw it away then we can allow automated processes, such as auto-scaling, user data bootstrapping, configuration management, and so on, to bring those systems back into a known state. And if we need that state to change we can go back to our source of truth. We can go back to our machine images to create that new preconfigured state, or we go back to our configuration management or user data bootstrapping scripts to create the state that we need when that machine is recovered or replaced.
- Microservices: We can also leverage microservice architecture so that, one, the risk of change is much lower so while deploying one small service if there’s a failure we can more easily rollback. There’s a lot of benefits to microservices and then, of course, we can design those services so that if anyone of the service fails other services can continue running.
- Automate Everything: We should automate everything so that we don’t rely on humans doing things slowly or mistakenly. The more that we can automate the more that we can avoid the errors that humans might inject and the faster and more efficient we can recover from any kind of failure.
- Scale when needed: Scalability is key, we should scale when we need to scale. And we can design systems that don’t require human intervention in order to scale, they just scale automatically using auto-scaling.
- Monitor Everything: We should also monitor everything. Not just the metrics like CPU and memory and network IO, but we should monitor everything within our application including logs.
- Build Stateless Applications: So, if an application is storing state then, of course, we have more dependencies to rely on. And then especially if that application is storing state locally, we have an issue of what happens when that state is gone so if our applications can be stateless that gives us a much easier ability to scale and decouple other components.
- Apply appropriate design patterns: We should be using certain patterns like circuit breakers where appropriate or request throttling where appropriate so that we can build in more resilience within the application itself and not just rely on the resilience of our infrastructure.