After almost 20 years of working in IT, I’m still amazed by how often people confuses High Availability (HA) with Disaster Recovery (DR). And I’m not talking about people on their first job but also professionals that are in the industry for a quite some years now…
Some people tend to minimize the difference, but in my opinion it is not something wise to do as both the architecture of the solution and the cost has a direct impact based on HA and DR options.
So first thing first, what is HA and what is DR. High Availability stands for a solution that will allow a system to be up and running when one or more of its components fails. Generally speaking there is a disruption suffered that tends to be in a low number of seconds and degradation might be noticed, but overall the system keeps working. HA should not involve any manual interaction to be executed.
Disaster Recovery, on the other hand, kicks in when all components fails. At that moment, the recovery of the systems needs to be executed and a disruption is observed, that could be from minutes to days, depending on the RTO (recovery time objective) of the affected systems. DR solution often involves manual operations.
One example of HA is if you have a web site that resides in three web servers that connects to a database in a cluster. If one or two web servers goes down, the remaining will still function. Operations happening on the two faulty web servers will be terminated and will need to be resubmitted. Users might see an error message and after some retry attempts, will be able to keep using the web site. Another example could be a server that has two power cords connected to different electrical phases in a data center. If one phase has a problem, the other one keeps delivering electricity to the server avoiding a power outage.
When we look for Disaster Recovery examples, you need to consider scenarios like, on the examples above, the three web servers goes down, or the database cluster goes down, or the two power phases goes down bringing the server offline, or there’s a flood on the data center and everything goes down. When you suffer from this, then you need to declare DR and start the recovery on a remote location. There are different alternatives for recovering, like recalling tapes from a storage facility and restore them on standby hardware. If replication is the method for DR then you will need to bring the system up on the remote location, etc.
DR is associated with two indicators: RTO (Recovery TIme Objective) and RPO (Recovery Point Objective). RTO is how much time the company is willing to wait for the system to be operational again. For instance, an RTO of one hour means that, once declared the DR, in one hour or less the system should be operational again. RPO is how much data the company is willing to lose. For instance, an RPO of 15 minutes means the company accepts losing the last 15 minutes of business. So in conjunction, you need to bring the system up in one hour or less and loosing 15 minutes of data or less.
HA is associated with one indicator: SLA (Service Level Agreement). When you set your SLA to be %99.9, that means that outside your maintenance window, you will be %99.9 of the time up and running. For instance, if you have a maintenance window of two hours a month, an SLA of %99.9 means that you can be down approx 8.7 hours a year (%99.99 is less than one hour a year).
If you have an SLA of %99.9, you need to invest in robust HA and DR solutions, which translates into bigger costs derived from more servers in the environment, more time is designing and testing the solution before the Go-Live and regular DR execution tests to ensure you meet your metrics. HA needs to be treated differently than DR as the goals they serve are different…