One recommendation to counteract SPOFs is to build redundancies. Several instances of a critical component (e.g., power supply, network connection, DNS server) are operated in parallel. If one fails, the system continues to operate without loss of performance.
Redundancy also prevents many SPOFs on the software-side. One example is the popular microservice compared to the software monolith. A system of microservices is decoupled and less complex, making it more robust against SPOFs. Since microservices are launched as containers making it easier to build redundancies.
But how exactly does redundancy protect a system? Let’s use the estimation of reliability of a system known as “Lusser’s law” to illustrate. Here’s a thought example:
Assume a system has two independent, parallel connections to a power supply. Let us further assume that the probability of the connection failing within a given period is 1 percent. Then the probability of complete failure of the power link can be calculated as the product of the probabilities:
- Probability of failure of an instance:
1% = 1 / 100 = 1 / 10 ^ 2 = 0.01
- Probability of two instances failing in succession:
1% * 1% = (1 / 10 ^ 2) ^ 2 = 1 / 10 ^ 4 = 0.0001
As you can see, the probability of a SPOF isn’t halved when running two instances but reduced by two orders of magnitude. That’s a considerable improvement. With three instances running in parallel, a failure of the entire system should be almost impossible.
Unfortunately, redundancy is no panacea. Rather, redundancy protects a system from SPOFs within certain assumptions. First, the probability of failure of an instance must be independent of the probability of failure of the redundant instance(s). That’s not the case where a failure is caused by an external event. If a data center is on fire, redundant components fail together.
In addition to redundancy of deployed components, distribution of certain components is critical to mitigate SPOFs. Geographic distribution of data storage and computing infrastructure protects from environmental disasters. Further, it pays to strive for some heterogeneity or diversity of critical system components. Diversity reduces the probability of redundant instances failing.
Let’s illustrate the advantage of diversity using the example of cybersecurity. Imagine a data center with redundant load balancers of the exact same design. A security vulnerability in one of the load balancers also presents in the redundant instances. In the worst case, an attack will paralyze all instances. By using different models, the overall system stands a better chance of continuing to operate at reduced performance.