Diagnose and Replace a Defective Hard Drive (Linux Dedicated Server with Hardware Raid)

In this article, you will learn how to identify a defective hard drive and prepare your server to replace the defective drive.

This article assumes that you have basic knowledge of Linux server administration. If you have any questions or need help with the replacement of the defective hard drive, please contact 1&1 IONOS Customer Support.

To ensure the greatest possible reliability of your drives, it is necessary that you monitor the hardware RAID of your dedicated server. If you discover that a hard drive is defective or you receive a notification email about a defective hard drive, you must contact customer service to arrange for the hard drive replacement. To get this done, you will first have to identify the defective hard drive and prepare the server for the drive exchange.

RAID systems enable greater reliability and/or higher speeds. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up your data regularly. Also, make sure that you back up your data before performing the following steps to ensure the security of your data.

For more information on creating backups, click here:

Backup Solutions

Backing Up Server Data (Linux)

Hardware RAID Controllers: General Information

A hardware RAID controller is a physical controller that is built into the server as a hardware component. This controller has its own processor for the calculation of RAID operations. This processor organizes and manages the memory space. Thus, the CPU of the server is not burdened by RAID calculations. For hardware RAID controllers, the RAID functionality is independent of the operating system. They are managed by special Command Line Interface (CLI) programs which can vary depending on the manufacturer and model.

Diagnosing Hard Drive Errors

In order to detect hard drive errors, we recommend that you use the smartctl program.

Smartctl is a command line program for monitoring volumes using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program you can check whether a hard drive is defective. It is a component of the Smartmontools. The Smartmontools are available as packages for many Linux distributions.

In some cases, it may be possible that a hard drive defect cannot be detected by the smart values. We would then recommend that you also analyze the log file /var/log/messages.

Installing Smartctl

To install Smartctl, type the following command:

CentOS:

yum install smartmontools

Ubuntu:

sudo apt-get install smartmontools

 
Determining the Hardware Controller Type

To check which hardware controller is installed in your server, you can use the lshw program. This program creates detailed information about hardware components.

To install the program, enter the following command:

CentOS:

yum install lshw

Ubuntu:

sudo apt-get install lshw

 
Displaying the Hardware Information

To display a summary of the hardware information, type the following command:

lshw -short

To output the hardware information as a text file, type the following command:

lshw > lshw_edition.txt

In the following example, a PERC H330 hardware controller is installed in the server:

root@829F6DF:~# lshw -short
H/W path             Device     Class          Description
==========================================================
                                system         PowerEdge R230 (SKU=NotProvided;ModelName=PowerEdge R230)
/0                              bus            0DWX9P
/0/0                            memory         64KiB BIOS
/0/400                          processor      Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz
/0/400/700                      memory         256KiB L1 cache
/0/400/701                      memory         1MiB L2 cache
/0/400/702                      memory         8MiB L3 cache
/0/1000                         memory         32GiB System Memory
/0/1000/0                       memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/1000/1                       memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/1000/2                       memory         [empty]
/0/1000/3                       memory         [empty]
/0/100                          bridge         Intel Corporation
/0/100/1                        bridge         Skylake PCIe Controller (x16)
/0/100/1/0           scsi0      storage        MegaRAID SAS-3 3008 [Fury]
/0/100/1/0/2.0.0     /dev/sda   disk           799GB PERC H330 Adp
/0/100/1/0/2.0.0/1   /dev/sda1  volume         2047KiB BIOS Boot partition
/0/100/1/0/2.0.0/2   /dev/sda2  volume         27GiB EXT3 volume
/0/100/1/0/2.0.0/3   /dev/sda3  volume         9536MiB Linux swap volume
/0/100/1/0/2.0.0/4   /dev/sda4  volume         707GiB LVM Physical Volume
/0/100/1.1                      bridge         Skylake PCIe Controller (x8)
/0/100/14                       bus            Sunrise Point-H USB 3.0 xHCI Controller
/0/100/14/0          usb1       bus            xHCI Host Controller
/0/100/14/0/3                   bus            Gadget USB HUB
/0/100/14/1          usb2       bus            xHCI Host Controller
/0/100/14.2                     generic        Sunrise Point-H Thermal subsystem
/0/100/16                       communication  Sunrise Point-H CSME HECI #1
/0/100/16.1                     communication  Sunrise Point-H CSME HECI #2
/0/100/17                       storage        Sunrise Point-H SATA controller [AHCI mode]
/0/100/1d                       bridge         Sunrise Point-H PCI Express Root Port #9
/0/100/1d/0          eth0       network        NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/100/1d/0.1        eth1       network        NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/100/1d.2                     bridge         Sunrise Point-H PCI Express Root Port #11
/0/100/1d.2/0                   bridge         SH7758 PCIe Switch [PS]
/0/100/1d.2/0/0                 bridge         SH7758 PCIe Switch [PS]
/0/100/1d.2/0/0/0               bridge         SH7758 PCIe-PCI Bridge [PPB]
/0/100/1d.2/0/0/0/0             display        G200eR2
/0/100/1f                       bridge         Sunrise Point-H LPC Controller
/0/100/1f.2                     memory         Memory controller
/0/100/1f.4                     bus            Sunrise Point-H SMBus
Viewing Hard Drive Information

To use Smartctl to access hard drive information, you must always specify the appropriate command in combination with an option and a target device. The target device depends on the controller manufacturer.

Use the commands listed below to display the information required for diagnosing the hard drive:

Manufacturer Hard disk Command
ARECA 1 smartctl -iHAl error /dev/sg1 -d areca,1
ARECA 2 smartctl -iHAl error /dev/sg1 -d areca,2
LSI / 3Ware 1 smartctl -iHAl error /dev/twe0 -d 3ware,0
LSI / 3Ware 2 smartctl -iHAl error /dev/twe0 -d 3ware,1
Adaptec 1 smartctl -iHAl error /dev/sg2 -d sat
Adaptec 2 smartctl -iHAl error /dev/sg3 -d sat
Adaptec (3) smartctl -iHAl error /dev/sg4 -d sat
Adaptec (4) smartctl -iHAl error /dev/sg5 -d sat
Dell 1 smartctl -iHAl error -d sat+megaraid,0 /dev/sda
Dell 2 smartctl -iHAl error -d sat+megaraid,1 /dev/sda
Broadcom 1 smartctl -iHAl error -d sat+megaraid,0 /dev/sda
Broadcom 2 smartctl -iHAl error -d sat+megaraid,1 /dev/sda

Additional commands for supported hardware controllers can be found on the following page:

https://www.smartmontools.org/wiki/Supported_RAID-Controllers

Example:

[root@localhost ~]# smartctl -iHAl error /dev/sg1 -d areca,1

smartctl 7.0 2018-12-30 r4883 [x86_64-w64-mingw32-2016] (sf-7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   141   140   021    Pre-fail  Always       -       3933
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
 16 Gas_Gauge               0x0022   000   200   000    Old_age   Always       -       1822115874
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   113   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Interpreting the Data

Look through the detailed information you pulled up. The first section lists information that you can use to identify the hard drive. For example, this section displays the device model, serial number, and size of the hard drive under tests.

 

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In the second section, the current state of the hard drive is evaluated by Smartctl. If, for example, the value Failed or UNKNOWN is displayed instead of the value PASSED, you should replace the hard drive as soon as possible.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

In the third section, the SMART VALUES determined are listed in detail. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective limit value (THRESH) are listed. If the current percentage value (VALUE) or the worst ever measured value (WORST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   141   140   021    Pre-fail  Always       -       3933
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
 16 Gas_Gauge               0x0022   000   200   000    Old_age   Always       -       1822115874
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   113   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

The following parameters can indicate an imminent hard drive failure before a SMART warning is displayed:

Reallocated_Sector_Ct: Specifies the number of sectors reassigned due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically assigned to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign for incipient surface problems. If this value is not equal to zero, a hard drive failure is often imminent. This value is the most important indicator for a hard drive replacement.

Current_Pending_Sector_Ct: Specifies the number of unstable sectors waiting for remapping. If a sector cannot be read and written correctly, it first receives the status Current Pending Sector. The sector is not reassigned in this state, since the data in the sector are unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned and the faulty sector is permanently marked as unreadable. The value Current_Pending_Sector_Ct is an important indicator for a hard drive replacement. If this value is not equal to zero, a hard drive failure is often imminent.

Offline_Uncorrectable: Specifies the number of uncorrectable write and read sector errors.

The last section deals with the internal hard drive log. Errors are recorded here if the server's work orders have not been processed correctly by the hard drive. If the number of errors in this section is at least two digits, you should replace the hard drive as soon as possible.

SMART Error Log Version: 1
No Errors Logged

Preparing Hard Drive Replacement

Viewing Detailed Information for Drive Replacement

The following information is required in order to replace the defective hard drive:

  • Name of the hard drive in the RAID

  • Serial number

  • Model

  • Log file (optional)

 

Creating a SMART Log

Use the commands listed below to generate a complete SMART log:

Manufacturer Hard disk Command
ARECA 1 smartctl -x /dev/sg1 -d areca,1
ARECA 2 smartctl -x /dev/sg1 -d areca,2
LSI / 3Ware 1 smartctl -x /dev/twe0 -d 3ware,0
LSI / 3Ware 2 smartctl -x /dev/twe0 -d 3ware,1
Adaptec 1 smartctl -x /dev/sg2 -d sat
Adaptec 2 smartctl -x /dev/sg3 -d sat
Adaptec (3) smartctl -x /dev/sg4 -d sat
Adaptec (4) smartctl -x /dev/sg5 -d sat
Dell 1 smartctl –x -d sat+megaraid,0 /dev/sda
Dell 2 smartctl –x -d sat+megaraid,1 /dev/sda
Broadcom 1 smartctl –x -d sat+megaraid,0 /dev/sda
Broadcom 2 smartctl –x -d sat+megaraid,1 /dev/sda
  • If the SMART log was created as described above, it is sufficient information. You can then have the defective hard drive replaced by 1&1 IONOS Customer Support.

  • If you cannot find the serial number of the defective hard drive using smartctl, you can alternatively provide Customer Service with the serial number of the functioning hard drive(s).

  • If you are unable to determine the information required for the replacement and wish to replace the hard drive, the hardware must be checked before replacing it. During this check, the server is usually temporarily unavailable. If a defect in the hard drive is detected during this test, it will be replaced.

Arranging Hard Drive Replacement

You can then have the defective hard drive replaced. Please contact 1&1 IONOS Customer Support to get this done.

Steps to Take After Replacing the Hard Drive

After the defective hard drive has been replaced, the RAID system usually starts rebuilding automatically. Please check whether the RAID system is starting to rebuild and is carried out successfully.