Highly available computing infrastructure is the norm in the computing industry today. More so, when it comes to the cloud platforms, it’s the key feature which enables the workloads running on them to be highly available.
High availability also known as HA is the ability of the system to stay online despite having failures at the infrastructural level in real-time.
The sole mission of highly available systems is to stay online & stay connected. A very basic example of this is having back-up generators to ensure continuous power supply in case of any power outages.
In the industry, HA is often expressed as a percentage. For instance, when the system is 99.99999% highly available, it simply means 99.99999% of the total hosting time the service will be up. You might often see this in the SLA (Service Level Agreements) of cloud platforms.
It might not impact businesses that much if social applications go down for a bit & then bounce back. However, there are mission-critical systems like aircraft systems, spacecrafts, mining machines, hospital servers, finance stock market systems that just cannot afford to go down at any time. After all, lives depend on it.
To meet the high availability requirements systems are designed to be fault-tolerant, their components are made redundant.
Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating (отклоняется) from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
Bugs are typically systematic and hard to deal with
Corrupt software files. Remember the BSOD blue screen of death in windows? OS crashing, memory-hogging unresponsive processes. Likewise, software running on cloud nodes crash unpredictably, along with it they take down the entire node.
Another reason for system fault is hardware crashes. Overloaded CPU, RAM, hard disk failures, nodes going down. Network outages.
This is the biggest reason for system failures. Flawed configurations & stuff.
Besides the unplanned crashes, there are planned down times which involve routine maintenance operations, patching of software, hardware upgrades etc.