A single point of failure (SPOF) describes a system vul­ner­a­bil­i­ty in the form of a single component. If the component fails, the entire system fails. What are the different types of SPOF and how can you minimize the risk of SPOFs happening?

$1 Domain Names – Grab your favorite one
  • Simple reg­is­tra­tion
  • Premium TLDs at great prices
  • 24/7 personal con­sul­tant included
  • Free privacy pro­tec­tion for eligible domains

What is a single point of failure?

A single point of failure (SPOF) describes a type of vul­ner­a­bil­i­ty within a system. A SPOF exists when the mal­func­tion of a single component causes the failure of the entire system. Several “failure modes” exist. These can be broadly dis­tin­guished into three cat­e­gories:

  1. Achilles’ heel or “weakest link in the chain”: The loss of one component leads to a sudden loss of function of the entire system.
  2. Chain reaction or “domino effect”: The failure of one component causes the suc­ces­sive failure of other com­po­nents leading to the failure of the entire system.
  3. Bot­tle­neck: A component acts as a limiting factor of the overall system. If the limiting component is impaired, the per­for­mance of the system is reduced to the capacity of the component.
Note

A single point of failure doesn’t nec­es­sar­i­ly describe a technical component. One of the most frequent cases is human error.

Where do single points of failure occur most often?

SPOFs are common in complex systems with in­ter­con­nect­ed com­po­nents in multiple layers. Depending on the structure of the system, the failure of one critical component causes the failure of the whole system. The single point of failure takes the form of a critical component.

The com­plex­i­ty of a multi-layered system can make it difficult to detect SPOFs. There’s no easy way to identify the in­ter­ac­tions of in­di­vid­ual com­po­nents. Faults or issues are hard to spot. Prin­ci­pal­ly, every non-redundant component critical for operation should be treated as a single point of failure.

Take the human body, for example. We’ve only got one heart or brain – the critical organs are not designed re­dun­dant­ly. If one of these organs fails, the entire body fails. Heart and brain are SPOFs. By contrast, eyes, ears, lungs, and kidneys are du­pli­cat­ed. If necessary, the body com­pen­sates for the failure of one and continues operating at reduced ef­fi­cien­cy.

In a data center, all com­po­nents critical to operation are potential SPOFs. Therefore, servers are usually equipped with redundant con­nec­tions to the power grid and network. Mass storage is provided re­dun­dant­ly via RAIDs or similar tech­nolo­gies. The aim is to ensure the system continues to operate should a single, critical component fail.

Tip

Not sure what a server is? Check out our article that explains what a server is.

What are some classic SPOF examples?

There are many different types of single points of failures (SPOFs). After all, SPOFs don’t just affect in­for­ma­tion systems. Let’s take a look at some examples.

Death Star destroyed by single point of failure

In the popular “Star Wars” movies, a single point of failure leads to the de­struc­tion of the dreaded “Death Star”. A single proton torpedo fired by the pro­tag­o­nist hits a critical spot on the reactor. The explosion causes a cat­a­stroph­ic chain reaction that destroys the entire Death Star.

Suez Canal paralyzed by single point of failure

In 2021, container ship “Ever Given” got stuck in the Suez Canal. The ship ran aground at a critical section of the canal acting as a single waterway. The blockage paralyzed shipping traffic along the entire canal. The single point of failure was the non-redundant waterway.

Boeing 737 MAX crashed by SPOF

In 2018 and 2019 there were two crashes of the “Boeing 737 MAX” aircraft causing the loss of hundreds of lives. The cause of the crashes was a single sensor feeding erroneous data. Based on the sensor data, the automatic flight control system didn’t perform correctly and ul­ti­mate­ly brought down the planes. Several errors came together, but the single point of failure was the sensor.

High-avail­abil­i­ty systems taken offline by SPOF

Even systems designed for high avail­abil­i­ty aren’t fully protected from SPOFs. In recent years, major cloud services have re­peat­ed­ly ex­pe­ri­enced serious failures. In most cases, the single point of failure was human. The wrong con­fig­u­ra­tion data can quickly paralyze an entire pro­duc­tion system, even if its com­po­nents are designed re­dun­dant­ly.

DNS as single point of failure in computing systems

Your device is connected to Wi-Fi, but the web browser isn’t working. Then the clock starts au­to­mat­i­cal­ly adjusting the time. Sound familiar? It’s enough to make you tear your hair out, but the answer is simple:

Quote

“It’s always DNS.” – Source: https://tale­so­fat­e­ch.com/2017/03/rule-1-its-always-dns/

The catch­phrase “It’s always DNS” sounds fun but is a serious de­scrip­tion of the error potential of Domain Name Systems (DNS). After all, when DNS servers don’t answer, websites and services can fail in a variety of ways. The effect is similar to having your con­nec­tion to the Internet cut. However, packet traffic between IP addresses still works in this case.

DNS errors usually occur on the user side if the DNS servers stored in the system are not ac­ces­si­ble. It’s therefore best practice to store two name server addresses. If the first server is un­avail­able, the second is used. This creates re­dun­dan­cy and resolves the single point of failure.

Often, both DNS servers belong to the same or­ga­ni­za­tion. If one of them fails, there’s a high prob­a­bil­i­ty that the other is also affected. To be safe you can store the addresses of two name­servers from different or­ga­ni­za­tions. A popular com­bi­na­tion is 1.1.1.1 and 9.9.9.9 from Cloud­flare and Quad9 as primary and secondary DNS servers.

Java logging library as single point of failure

By the end of 2021, a large number of Java-based web services were affected by the Log4Shell security gap. The single point of failure was a Java logging library called Log4J. In the worst case, a system attack led to the takeover of an entire vul­ner­a­ble system.

How to avoid SPOFs?

Generally, pre­ven­tion is the best strategy to avoid SPOFs. By de­f­i­n­i­tion, a single point of failure leads to the loss of function of the entire system. Once that happens, it’s often too late. Limiting the damage may be your only option now.

That’s why pre­ven­tive measures and planning for emer­gen­cies are a better strategy. You can act out credible failure scenarios and analyze risks and possible pro­tec­tive measures. Different types of single points of failure can be prevented by certain features in a system:

System feature Protects against De­scrip­tion Example
Re­dun­dan­cy Achilles’ heel, bot­tle­neck System can continue to operate without per­for­mance degra­da­tion in the event of failure Multiple DNS servers stored in network device
Diversity Chain reaction Lowers risk of redundant com­po­nents being affected by failure Linux computers not vul­ner­a­ble to Windows Trojans
Dis­tri­b­u­tion Chain reaction, Achilles’ heel, bot­tle­neck Lowers risk of redundant com­po­nents being affected by failure Head of state doesn’t travel on the same plane as his vice
Isolation Chain reaction Disrupts domino effect Fuse protects power grid from overload
Puffer Bot­tle­neck Absorbs load peaks occurring before bot­tle­necks Queue in front of check-in counter at airport
Graceful Degra­da­tion Achilles’ heel, chain reaction Allows for continued operation of the system without cat­a­stroph­ic result in case in­di­vid­ual com­po­nents fail When losing one eye, vision is not entirely lost but depth per­cep­tion is disrupted

Well-prepared, critical systems are subjected to con­tin­u­ous mon­i­tor­ing to detect errors as early as possible and correct them if necessary.

Minimize single points of failure through re­dun­dan­cy

One rec­om­men­da­tion to coun­ter­act SPOFs is to build re­dun­dan­cies. Several instances of a critical component (e.g., power supply, network con­nec­tion, DNS server) are operated in parallel. If one fails, the system continues to operate without loss of per­for­mance.

Re­dun­dan­cy also prevents many SPOFs on the software-side. One example is the popular mi­croser­vice compared to the software monolith. A system of mi­croser­vices is decoupled and less complex, making it more robust against SPOFs. Since mi­croser­vices are launched as con­tain­ers making it easier to build re­dun­dan­cies.

But how exactly does re­dun­dan­cy protect a system? Let’s use the es­ti­ma­tion of re­li­a­bil­i­ty of a system known as “Lusser’s law” to il­lus­trate. Here’s a thought example:

Assume a system has two in­de­pen­dent, parallel con­nec­tions to a power supply. Let us further assume that the prob­a­bil­i­ty of the con­nec­tion failing within a given period is 1 percent. Then the prob­a­bil­i­ty of complete failure of the power link can be cal­cu­lat­ed as the product of the prob­a­bil­i­ties:

  1. Prob­a­bil­i­ty of failure of an instance:

1% = 1 / 100 = 1 / 10 ^ 2 = 0.01

  1. Prob­a­bil­i­ty of two instances failing in suc­ces­sion:

1% * 1% = (1 / 10 ^ 2) ^ 2 = 1 / 10 ^ 4 = 0.0001

As you can see, the prob­a­bil­i­ty of a SPOF isn’t halved when running two instances but reduced by two orders of magnitude. That’s a con­sid­er­able im­prove­ment. With three instances running in parallel, a failure of the entire system should be almost im­pos­si­ble.

Un­for­tu­nate­ly, re­dun­dan­cy is no panacea. Rather, re­dun­dan­cy protects a system from SPOFs within certain as­sump­tions. First, the prob­a­bil­i­ty of failure of an instance must be in­de­pen­dent of the prob­a­bil­i­ty of failure of the redundant instance(s). That’s not the case where a failure is caused by an external event. If a data center is on fire, redundant com­po­nents fail together.

In addition to re­dun­dan­cy of deployed com­po­nents, dis­tri­b­u­tion of certain com­po­nents is critical to mitigate SPOFs. Ge­o­graph­ic dis­tri­b­u­tion of data storage and computing in­fra­struc­ture protects from en­vi­ron­men­tal disasters. Further, it pays to strive for some het­ero­gene­ity or diversity of critical system com­po­nents. Diversity reduces the prob­a­bil­i­ty of redundant instances failing.

Let’s il­lus­trate the advantage of diversity using the example of cy­ber­se­cu­ri­ty. Imagine a data center with redundant load balancers of the exact same design. A security vul­ner­a­bil­i­ty in one of the load balancers also presents in the redundant instances. In the worst case, an attack will paralyze all instances. By using different models, the overall system stands a better chance of con­tin­u­ing to operate at reduced per­for­mance.

Other strate­gies to minimize SPOF

Re­dun­dan­cy isn’t always suf­fi­cient to prevent SPOFs. And some com­po­nents cannot be designed re­dun­dant­ly. When creating re­dun­dan­cy isn’t an option, other strate­gies come into play.

The “defense in depth” approach is well-known from cyber security. This involves drawing multiple, in­de­pen­dent rings of pro­tec­tion around a system. These must be breached one after another to bring about system failure. The like­li­hood of the entire system failing because of a single component is lower.

With respect to digital systems, special pro­gram­ming languages with a built-in fault tolerance exist. The best-known example is the “Erlang” language developed at the end of the 1980s. Together with the as­so­ci­at­ed runtime en­vi­ron­ment, the language is suitable for creating highly available, fault-tolerant systems.

The global triumph of the World Wide Web and the spread of web de­vel­op­ment presented a new challenge. Pro­gram­mers were forced to develop websites that would work on a variety of devices. The basic approach used in this process is known as “graceful degra­da­tion”. If a browser or device doesn’t support a par­tic­u­lar tech­nol­o­gy on a page, it’s rendered as good as possible. This is a “fail-soft” approach:

System status De­scrip­tion
go System operates safely and correctly
fail-op­er­a­tional System operates fail-tolerant without per­for­mance degra­da­tion
fail-soft System operation ensured, but per­for­mance reduced
fail-safe No operation possible, system security still guar­an­teed
fail-unsafe Un­pre­dictable system behavior
fail-badly Pre­dictably poor to cat­a­stroph­ic system behavior

How to find a SPOF in your IT?

Don’t wait until the system fails to identify single points of failure in your system. You’ll want to proceed proac­tive­ly as part of a Risk Man­age­ment Strategy. Analyses from re­li­a­bil­i­ty en­gi­neer­ing such as fault tree or event tree analysis are used. Failure Mode and Effects Analysis (FMEA) are used to identify the most critical sources of failure.

The general approach to iden­ti­fy­ing single points of failure in en­ter­prise IT is to perform a risk as­sess­ment of the three main di­men­sions:

  • Hardware
  • Software/services/provider
  • Personal

First, create a SPOF analysis checklist to show the general areas for further analysis. Then, a detailed SPOF analysis of the in­di­vid­ual areas is performed:

  • Un­mon­i­tored devices in the network
  • Non-redundant software or hardware systems
  • Staff and providers who cannot be replaced in an emergency
  • Any data not included in backups

For each system component, the negative effect of failure is analyzed. Fur­ther­more, the ap­prox­i­mate prob­a­bil­i­ty of a cat­a­stroph­ic failure is estimated. The results are in­cor­po­rat­ed into an over­ar­ch­ing “disaster recovery” plan to ensure data center security.

As a basic measure to avoid SPOFs, re­dun­dan­cy of data storage and computing power should be ensured at three levels:

  • At the server level through redundant hardware com­po­nents
  • At the system level through clus­ter­ing, i.e. the use of multiple servers
  • At data center level by using ge­o­graph­i­cal­ly dis­trib­uted operating sites.
Go to Main Menu