When it comes to online crime, busi­ness­es think primarily of economic espionage, stealing sensitive business data, and violating data pro­tec­tion. But in­creas­ing dig­i­tal­iza­tion has meant that online attacks have attained and new level. More and more busi­ness­es depend on IT systems that link companies to public networks and provide hackers with op­por­tu­ni­ties to attack. If a cyber attack causes system failure, this could lead to in­ter­rup­tions that can prove costly. It only takes a few minutes for server failure to cause thousands of dollars’ worth of damage. Companies that could ex­pe­ri­ence es­pe­cial­ly large losses are those whose servers host shopping software or provide a central database. However, server failures aren’t just caused by external sources; internal risks can also threaten the operation.

In addition to pro­tect­ing against external threats and the standard pro­ce­dures relating to disaster recovery, a solid security concept also includes or­ga­ni­za­tion­al and staffing measures. Coun­ter­mea­sures are generally based on com­pen­sa­tion: tech­ni­cal­ly, this is based on providing redundant hardware within the context of high avail­abil­i­ty, or to bypass downtime with stand-by systems. Data security can be ensured by backup and recovery software as well as by redundant memory ar­chi­tec­tures. The financial con­se­quences of server failure can po­ten­tial­ly be covered by insurance.

Failure scenarios at a glance

Security experts dif­fer­en­ti­ate the causes of server failure between internal and external threats. Internal threats include all scenarios where failures are caused by a company’s own IT in­fra­struc­ture, utilities, or employee error. External threats, on the other hand, are de­lib­er­ate external attacks or un­pre­dictable events such as accidents or disasters.

Internal hazards:

  • Fire in the computer center
  • Power outage in the computer center
  • Hardware failure (hard drive crash, overload, over­heat­ing)
  • Software error (database failure)
  • Network issues
  • Human error

External hazards:

  • In­fil­tra­tion (man-in-the-middle attack, phishing, social en­gi­neer­ing)
  • Sabotage (attacks on SCADA systems)
  • Viruses, Trojans, and worms
  • Dis­trib­uted denial of service attack (DDoS)
  • Hardware theft
  • Natural disasters (earth­quakes, lightning, flooding)
  • Accidents (plane crash)
  • Attacks

As a rule, companies find it easier to prepare for internal security risks than external threats. The reason for this is that hackers always adapt their attack patterns to current security standards and con­tin­u­al­ly attack corporate networks with new malicious programs or in­fil­tra­tion strate­gies. Companies can reduce the risk of internal dangers, on the other hand, through an un­in­ter­rupt­ed power supply, fire pro­tec­tion measures, highly available servers, and com­pre­hen­sive security training.

Con­se­quences of system failure

The financial cost of server failure depends on several factors and also what kind of server is down; an e-mail server, web server, analytics server? How long the server was down also plays a part. If it was only a few minutes, it might not be worth cal­cu­lat­ing the loss, but for longer amounts of time it might make sense to work it out. If the server was being used by employees, you need to work out how much the employees were paid to tech­ni­cal­ly do nothing, which obviously depends on their salary. If the culprit is an e-commerce server, it makes sense to calculate how many orders couldn’t be placed during the time the server was down. To do this, look at the time period e.g. Wednesday 5-7pm and compare it with how many orders you normally receive during this time. If an e-mail server was down, the cost depends on how much your company relies on e-mail traffic. Customers may be annoyed that they didn’t receive quick answers to their queries if they’re used to doing so. This could be enough for some customers to stop using your service or buying your products.

Don’t forget the actual cost of fixing the server. It is of course always a good idea to have proper backups at the ready in case the server does go down.

Whether server failure causes service in­ter­rup­tion and to what extent, depends on the re­spec­tive industry and the business model. To waste as little money as possible, you could start on other tasks when a server failure prevents you from doing your regular work; call meetings, make phone calls, or bring customer meetings forward. If your central process relies entirely on IT this could prove a bit more difficult. It costs the company an ex­cep­tion­al amount of money if customers aren’t able to place orders, or the SCADA (su­per­vi­so­ry control and data ac­qui­si­tion) system failure paralyzes the pro­duc­tion line.

When cal­cu­lat­ing the cost of service in­ter­rup­tion, in addition to taking into account the employees’ hourly salaries and losses due to fewer or no customer orders, you might also face con­trac­tu­al fines due to delays in delivery times. Your rep­u­ta­tion could also be at risk, but it’s almost im­pos­si­ble to calculate such a factor.

Coun­ter­mea­sures

In order to coun­ter­act server failures, you need to implement some pre­ven­ta­tive measures. This usually refers to a series of in­fra­struc­tur­al and or­ga­ni­za­tion­al measures when selecting and designing the server room. A lot of helpful in­for­ma­tion on how to avoid as well as recover from server failure can be found on Oracle.

Fire pro­tec­tion and utilities

In order to prevent server failures due to physical in­flu­ences such as fires, floods, power failures, or hardware sabotage, server rooms and data centers need to be con­fig­ured ac­cord­ing­ly. The first step is to decide where the server should be located. Basements are def­i­nite­ly not rec­om­mend­ed since there’s the danger of them flooding during storms or natural disasters. In addition, access to the premises should be re­strict­ed to spe­cial­ist personnel and be con­trolled by a security check. Server rooms are not rec­om­mend­ed to be used as permanent work­places.

Fire damage can possibly be prevented by in­stalling fire pro­tec­tion and ex­tin­guish­ing systems. This includes in­stalling fire pro­tec­tion doors, fire detection devices, hand-held fire ex­tin­guish­ers, and automatic fire ex­tin­guish­ing systems (e.g. gas ex­tin­guish­ing systems). Further pre­ven­ta­tive measures include fire pro­tec­tion re­quire­ments for storing com­bustible materials correctly, using fire-resistant sealing on cables, and using suitable in­su­la­tion materials for thermal in­su­la­tion or sound­proof­ing.

Technical equipment converts elec­tri­cal energy into heat. The tem­per­a­ture in the server room could rise due to the sun’s rays seeping in. In order to prevent server failures and data errors due to over­heat­ing and high humidity, you should use powerful ven­ti­la­tion and cooling systems. Optimal storage con­di­tions for long-term storage media is a tem­per­a­ture of between 68°F and 72°F, and with 40% humidity.

The basic pre-requisite for smooth server operation is a constant power supply. In­ter­rup­tions as small as 10ms can lead to IT mal­func­tions. It is possible to bridge supply gaps and longer-lasting failures by using standby gen­er­a­tors. These enable you to have a self-suf­fi­cient operation that is in­de­pen­dent of the public elec­tric­i­ty network, thereby helping to avoid in­ter­rup­tions in the usual operation.

Failure safety

Medium-sized companies in par­tic­u­lar un­der­es­ti­mate the impact that IT in­ter­rup­tions have on business op­er­a­tions. A reason for this is the high re­li­a­bil­i­ty of standard com­po­nents that are now used in corporate IT. Their avail­abil­i­ty is generally estimated to be 99.9%. This number might seem high, but if the system operates 24 hours daily for a year, this could result in a maximum downtime of almost 9 hours. If this ends up happening exactly during the peak sales period, a rel­a­tive­ly short server failure could prove costly to the company. Highly-available IT systems with an avail­abil­i­ty of 99.99% are the standard now when it comes to supplying business-critical data and ap­pli­ca­tions. In this case, a maximum downtime of 52 minutes per year is guar­an­teed. Some IT experts even believe 99.999% avail­abil­i­ty is possible, which would mean no more than 5 minutes of downtime annually. The problem with the in­for­ma­tion regarding avail­abil­i­ty is that it only refers to the re­li­a­bil­i­ty of server hardware. According to IEEE (Institute of Elec­tri­cal and Elec­tron­ics Engineers)’s de­f­i­n­i­tion, a system is highly-available if it can ensure the avail­abil­i­ty of its IT resources despite several server com­po­nents failing. 'High Avail­abil­i­ty (HA for short) refers to the avail­abil­i­ty of resources in a computer system, in the wake of component failures in the system.' This is achieved, for example, by servers that are com­plete­ly redundant. All operating com­po­nents – es­pe­cial­ly proces­sors, memory chips, and I/O units – are available twice. This prevents a defective component from par­a­lyz­ing the server, but high-avail­abil­i­ty doesn’t protect against fire in the data center, targeted attacks by malicious software and DDoS attacks, sabotage or being taken over by hackers. When it comes to real op­er­a­tions, en­tre­pre­neurs should therefore expect sig­nif­i­cant­ly longer downtimes and take ap­pro­pri­ate measures to prevent and limit damage. Other strate­gies to com­pen­sate for the failure of server resources in the data center are based on standby systems and high-avail­abil­i­ty clusters. Both ap­proach­es are based on networks of two or more servers, which together provide more hardware resources than are needed for normal operation. A standby system is a second server that is used to safeguard the primary system as soon as it fails due to a hardware or software error. This service takeover is known as failover and initiated au­to­mat­i­cal­ly by cluster manager software without any ad­min­is­tra­tor in­ter­ven­tion. A structure like this, con­sist­ing of an active and passive server node can be con­sid­ered as an asym­met­ric high avail­abil­i­ty cluster. If all nodes in the cluster provide services in normal op­er­a­tions, this is known as a sym­met­ri­cal structure. Since a time delay occurs when a service is migrated from one server to another, short-term dis­rup­tion to standby systems and high-avail­abil­i­ty clusters cannot be com­plete­ly prevented.

Defense systems

To counter hackers’ damaging influence, ad­min­is­tra­tors use various software and hardware solutions that should detect, avert, register, and deflect their attacks. In order to protect a server against unau­tho­rized access, critical systems with firewalls and de­mil­i­ta­rized zones are closed off from public networks.

Intrusion detection systems (IDS) enable automated mon­i­tor­ing of servers and networks, alerting users as soon as manual break-in attempts or automated attacks by malicious software are detected: a process based on pattern recog­ni­tion and sta­tis­ti­cal analysis. If intrusion pre­ven­tion systems (IPS) are used, automated coun­ter­mea­sures take place after the alert. It’s common to be connected to a firewall so that data packets can be discarded or sus­pi­cious con­nec­tions can be in­ter­rupt­ed.

In order to keep hackers away from business-critical IT systems, ad­min­is­tra­tors also use honeypots. These appear to hackers as sup­pos­ed­ly at­trac­tive targets that run isolated from the pro­duc­tive system and therefore have no influence on the system’s func­tion­al­i­ty. Honeypots are con­stant­ly monitored and enable admins to respond quickly to failures and are able to analyze attack patterns and strate­gies used.

Data backup and recovery

In order to be able to quickly restore business-critical data even in the event of server failure, it’s rec­om­mend­able to develop a data pro­tec­tion concept in ac­cor­dance with in­ter­na­tion­al industry standards such as ISO 27001. This regulates who is re­spon­si­ble for the data backup and names the decision makers who can provide data recovery. Fur­ther­more, the data backup concept de­ter­mines when backups have to be created, how many gen­er­a­tions should be saved, which storage medium should be used, and whether specific transport modal­i­ties are required such as en­cryp­tion. In addition, the type of data backup is defined:

  • Full data backup: if all the data that you wish to back up is stored on an ad­di­tion­al storage system at a certain time, this is referred to as a full data backup. Whether the data has changed since the last memory process is not taken into account in backups like these. A full data storage therefore takes a long time and has a high memory re­quire­ment, which is par­tic­u­lar­ly important when several gen­er­a­tions are stored parallel. This type of data backup boasts simple and fast data recovery since only the last memory state has to be re­con­struct­ed. However, companies lose this advantage when backups aren’t carried out regularly enough. If this is the case, a lot of effort is required to adapt sub­se­quent­ly modified files to the current state.
  • In­cre­men­tal data backup: if companies decide on an in­cre­men­tal backup, the backup includes only those files that have changed since the last backup. This reduces the time needed to perform a backup, as well as meaning that the memory re­quire­ments for different gen­er­a­tions are also sig­nif­i­cant­ly lower than with full data storage. An in­cre­men­tal data backup requires at least one back-up generated by a full-data backup. In practice, therefore, there are often com­bi­na­tions of both storage strate­gies. Several in­cre­men­tal backups are generated between two full-size backups. When it comes to data recovery, the last full data backup is used as the basis and sup­ple­ment­ed by the data from the in­cre­men­tal storage cycles. As a rule, several data backups must be aligned one after the other.
  • Dif­fer­en­tial data backup: dif­fer­en­tial data backup is also based on full data pro­tec­tion. All data, which has changed since the last full data backup, is backed up. In contrast to the in­cre­men­tal data backup, these backups aren’t linked together. For data recovery, it’s enough to compare the last full data backup with the most recent dif­fer­en­tial backup.

The storage strategy used in the company depends on the avail­abil­i­ty you require as well as eco­nom­i­cal aspects. Central in­flu­enc­ing factors are tolerable recovery times, the frequency and time of the data backup as well as the re­la­tion­ship between the change volume and the total data volumes. If the latter are almost identical, it’s possible to save memory by in­cre­men­tal or dif­fer­en­tial methods.

Training

In­for­ma­tion security methods can only be es­tab­lished through­out the company if all employees recognize and accept that they are partly re­spon­si­ble for the company’s economic success. Security awareness can be raised and main­tained if the company provides regular training courses, aimed at fa­mil­iar­iz­ing employees with internal and external risks, and possible scenarios.

The basis of sys­tem­at­ic training courses are the rules and reg­u­la­tions for handling security-relevant devices as well as a disaster recovery plan, which provides employees with in­struc­tions about which steps to take to restore normal operation as quickly as possible. A struc­tured approach to creating ap­pro­pri­ate concepts is provided by business con­ti­nu­ity man­age­ment.

Business con­ti­nu­ity man­age­ment (BCM)

In order to minimize damage caused by server failures, companies in­creas­ing­ly invest in pre­ven­ta­tive measures. The focus is on business con­ti­nu­ity man­age­ment (BCM). In the IT sector, BCM strate­gies are aimed at coun­ter­act­ing server failures in critical business areas, and ensuring immediate recovery if an in­ter­rup­tion occurs. Business impact analysis (BIA) is the pre-requisite for ap­pro­pri­ate emergency man­age­ment. This helps companies identify critical business processes. A process is classed as critical when a server failure has a sig­nif­i­cant impact on operation. The BIA con­cen­trates on the con­se­quences of concrete damage scenarios. Causes of server failures, the prob­a­bil­i­ty that possible threats might occur, and coun­ter­mea­sures are recorded in the risk analysis. How BIA and risk analysis can be im­ple­ment­ed method­olog­i­cal­ly within the framework of BCM is the substance of various standards and frame­works. The BSI Standard 100-4 is rec­om­mend­ed as a detailed guide.

Business Impact Analysis (BIA)

The first step towards com­pre­hen­sive business con­ti­nu­ity man­age­ment is the business impact analysis. Key questions regarding this analysis are: which systems are the most important in main­tain­ing the core business? What does it mean for the operation if these systems fail? It is advisable to identify the company’s most important products and services as well as the un­der­ly­ing IT in­fra­struc­ture. If a company primarily relies on internet sales, the servers that supply the online store and its as­so­ci­at­ed databases def­i­nite­ly need to be protected. A call center, on the other hand, would classify its telephone system as critical for running its business. The BIA includes a pri­or­i­ti­za­tion of systems that are to be protected, a way of cal­cu­lat­ing losses, as well as informing which resources are required for system recovery.

Risk analysis

A risk analysis within the scope of emergency man­age­ment enables you to identify internal and external risks that could cause a server to fail and con­se­quent­ly interrupt the operation. The aim is to make any security risks and their causes known and to develop ap­pro­pri­ate coun­ter­mea­sures in order to reduce any potential danger. An as­sess­ment of the risks can be made on the basis of the expected damage and the like­li­hood that it will occur. An example of a risk clas­si­fi­ca­tion is shown in the following example from the BSI Standard 100-4:

Recording the current state

If the risks and damage potential in a server failure scenario were de­ter­mined within the framework of the BIA and risk analysis, the third step on the con­tin­u­a­tion strategy journey is to record the actual state. Emergency measures that have already been es­tab­lished as well as current recovery times are important for this step. Recording the actual state enables companies to estimate the need for action in case of serious security risks and their as­so­ci­at­ed in­vest­ment costs.

Selecting the con­tin­u­a­tion strategy

As a rule, there are various strate­gies for the different internal and external hazards, which allow the operation to continue running despite dis­rup­tions, or promise a speedy recovery at least. When it comes to business con­ti­nu­ity man­age­ment, it is, therefore the decision maker’s re­spon­si­bil­i­ty to decide on the con­tin­u­a­tion strategy to be used in an emergency. The decision is based on a cost-benefit analysis that includes key factors, such as which financial resources are required, how reliable the solution is, and the estimated recovery time.

There are several solutions available if you want to develop a con­tin­u­a­tion strategy to prevent a data center fire: minimal solutions include com­pen­sa­tion for insurance paid out due to op­er­a­tional failures and a re­place­ment center with a hosting service provider. It would be more costly to convert the existing server room so that it complies with modern fire pro­tec­tion standards. If larger in­vest­ments are available, con­se­quen­tial damage can be reduced by building a second, redundant server room.

Already es­tab­lished con­tin­u­a­tion strate­gies are defined in the emergency safety concept, which contains specific in­struc­tions for all relevant emergency scenarios.

Go to Main Menu