Diagnose and Replace a Defective Hard Drive (Linux Dedicated Server with Software RAID)

In this article, we'll show you how to identify a defective hard disk on a Linux Dedicated Server with software RAID and prepare the server for the replacement of the defective disk.

Please Note

This article assumes you have basic knowledge of server administration with Linux. If you have any questions regarding the replacement of a defective hard disk or need assistance, please contact IONOS Customer Service.

In order to ensure the highest possible reliability, it is necessary that you monitor the software RAID of your Dedicated Server. If you discover that a hard disk is defective or you receive a notification email about a defective hard disk, you must contact IONOS Customer Service to arrange for the hard disk to be replaced. This requires that you identify the defective hard disk and prepare the server to replace the defective disk.

Attention

RAID systems allow for greater fail-safety and/or speed. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up regularly. Also, be sure to back up before performing the steps below to ensure the safety of your data.

Checking the Status of the Software RAID

To check the status of the software RAID, enter the following command in the shell:

[root@host ~]: cat /proc/mdstat

If both disks are present and mounted correctly, the following message is displayed:

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 sda3[1] sdb3[0]
262016 blocks [2/2] [UU]

md1 : active raid1 sda2[1] sdb2[0]
119684160 blocks [2/2] [UU]

md0 : active raid1 sda1[1] sdb1[0]
102208 blocks [2/2] [UU]

unused devices: <none>

The above example shows three multiple devices or logical drives (md0, md1, md2). For each of these logical drives, it is indicated which partitions they are composed of and on which drives these partitions are located.

Example: The logical drive md0 is composed of the partitions sda1 and sdb 1.

In the line listed below the respective logical drive, the state of the individual partitions is shown at the end of the line in the square brackets. A U means that the respective disk is mounted (up) in the RAID.

In the following example, all logical drives have only one partition mounted, which is located on the sda hard disk. The respective partition located on the second hard disk sdb is not mounted. You can recognize this also by the entry [U_]. The unmounted partitions of the hard disk sdb indicate that there is an error or a defect with this hard disk.

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 sda1[1]
102208 blocks [2/1] [U_]

md1 : active raid1 sda2[1]
119684160 blocks [2/1] [U_]

md2 : active raid1 sda3[1]
262016 blocks [2/1] [U_]

unused devices: <none>

In the following example, a defective disk is still mounted in the RAID:

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
bitmap: 1/4 pages [4KB], 65536KB chunk

md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices:

<none>

The entry (F) in this example shows that the partition is marked as faulty.

Error Diagnosis and Finding the Necessary Data for Hard Disk Replacement

To detect hard disk errors, we recommend that you do the following:

Install the Smartctl program, which is a command-line program to monitor disks using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program you can check if a disk is defective. It is a part of Smartmontools. The Smartmontools are available as packages for many Linux distributions.

Please Note

In some cases, a hard disk defect may not be detected by means of the smart values. Therefore, we recommend that you also analyze the /var/log/messages log file.

Install Smartctl

To install Smartctl, enter the following command:

CentOS

yum install smartmontools

Ubuntu

sudo apt-get install smartmontools

Get information about the hard disk

To access a list of disks, enter the following command:

smartctl --scanExample:

[root@8E8885C ~]# smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device

To access detailed information for error diagnostics, enter the following command:

smartctl -iHAl error [FIXED NAMES]

Please Note

Device interfaces must be specified in the following format:

SCSI / SATA devices:

smartctl - iHAl error /dev/sd[a-z]

Example:

[root@localhost ~] # smartctl -iHAl error /dev/sda

After entering the command, the following information is displayed, for example:

[root@8E8885C ~]# smartctl -iHAl error /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri May  3 07:45:14 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
  3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
 16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0

SMART Error Log Version: 1
No Errors Logged

Interpretation of Parameters and Fault Diagnosis

Analyze the detailed information that you called by means of the command smartctl -iHAl error [NAMED DISK]. The first section lists information that you can use to identify the hard disk:

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri May  3 07:45:14 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

This section displays, among other things, the device model and serial number of the checked hard disk.

In the second section, the current state of the hard disk is assessed by Smartctl. If the value "PASSED" is not displayed but, for example, the value "Failed" or "UNKNOWN", you should arrange for the hard disk in question to be replaced as soon as possible.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

In the third section, the determined SMART VALUES are listed in detail. Next to each current percentage value (VALUE), the worst value ever measured (WORST) and the respective limit value (THRESH) are listed. If the current, percentage value (VALUE) or the worst, ever measured value (WOR ST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
  3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
 16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0

The following parameters can indicate an impending hard disk failure before a SMART warning is displayed:

Reallocated_Sector_Ct: Indicates the number of sectors that have been reallocated due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically allocated to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign of incipient surface problems. If this value is not zero, a hard disk failure is often imminent. This value is the most important indicator for a hard disk replacement.

Current_Pending_Sector_Ct: Indicates the number of unstable sectors waiting to be remapped. If a sector cannot be read and written to correctly, it initially receives the status Current Pending Sector. The sector is not reallocated in this state because the data located on the sector is unknown. Only after several unsuccessful read or write attempts is a replacement sector allocated and the faulty sector is permanently marked as unreadable. The Current_Pending_Sector_Ct value is an important indicator for a hard disk replacement. If this value is not zero, a hard disk failure is often imminent.

Offline_Uncorrectable: Indicates the number of uncorrectable errors during read and write access to sectors.

The last section deals with the internal hard disk log. Errors are recorded here if the servers work requests from the hard disk were not processed properly. If at least a two-digit error number is displayed in this section, you should arrange for the hard disk to be replaced as soon as possible.

SMART Error Log Version: 1
No Errors Logged

Required Information for Hard Disk Replacement

The following information is required to initiate the replacement of the defective hard disk:

Designation of the hard disk in the RAID (e.g. sda)
Serial number
Model
Log file (optional)

Creating a SMART Log

To create a full SMART log, enter the following command:

smartctl -x [NAMEFIXED]

Example:

[root@localhost ~]# smartctl -x /dev/sda

If the hard disk can no longer be accessed using Smartctl, you can use the hdparm program to retrieve the necessary information. How to install hdparm:

CentOS

yum -y install hdparm

Ubuntu/Debian

sudo apt-get update
sudo apt-get install hdparm

Then enter the following command to retrieve the information required for disk replacement:

hdparm -i /dev/sda

Notes

If the SMART log was created as described above, this is sufficient information. You can then arrange for the defective hard disk to be replaced. Please contact IONOS Customer Service for this.
If you cannot call up the serial number of the defective hard disk using Smartctl, you can alternatively provide the serial number of the working hard disk(s) to the customer service.

Preparing a Server for Hard Disk Replacement

The following example assumes that the second hard disk (sdb) is to be replaced. For example, the following status of the software RAID is displayed during the status check:

[root@host ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2]
439553856 blocks super 1.0 [2/1] [UU]

md1 : active raid1 sdb1[2] sda1[0]
19529600 blocks super 1.0 [2/1] [UU]

unused devices: <none>

The second hard disk (sdb) is still mounted in the RAID in this example and is therefore still in use.

Manually mark raid device as "faulty" to remove it from RAID

To mark the defective disk as "faulty" so that it can be removed from RAID, enter the following command:

[root@host ~]# mdadm PATH_DES_RAID_ARRAYS -f PATH_OF_FIXED DISK.

In the examples below, the sdb3 or sdb1 disks are marked as faulty:

[root@host ~]# mdadm /dev/md3 -f /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md3

[root@host ~]# mdadm /dev/md1 -f /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1

After entering the command, the RAID has the following status:

[root@host ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]

md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Remove partition/ from the Multiple Device

To remove a partition from the Multiple Device, issue the following command:

[root@host ~]# mdadm -r /PFAD_DES_RAID_ARRAYS /PFAD_DER_FESTPLATTE

In the examples below, the sdb3 and sdb1 disks are removed from the multiple device md3 and md1, respectively:

[root@host ~]# mdadm -r /dev/md3 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md3

[root@host ~]# mdadm -r /dev/md1 /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md1

Then check the status of the RAID. In this example, the RAID that was prepared for disk replacement has the following final state:

[root@host ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0]
439553856 blocks super 1.0 [2/1] [U_]

md1 : active raid1 sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Check which swap partitions are used

Check which swap partitions are used by the operating system. To do this, type the following command:

[root@host ~]# cat /proc/swaps

Filename Type Size Used Priority
/dev/sda2 partition 9765884 0 -1
/dev/sdb2 partition 9765884 0 -2

Alternatively, you can check which swap partitions are defined in fstab by entering the following command:

[root@host ~]# grep swap /etc/fstab
/dev/sda2 none swap sw
/dev/sdb2 none swap sw

Disable swap partition on the defective device

Disable the swap partition on the defective disk so that it can be swapped. To do this, type the following command:

[root@host ~]# swapoff PATH_OF_FIXED_DISK

Example:

[root@host ~]# swapoff /dev/sdb2

Please Note

If the swap partition on the defective disk is not deactivated and a disk replacement is performed, the swap partition in /proc/swaps receives the deleted status.

Arranging for Hard Disk Replacement

Now the replacement of the defective hard disk can be arranged. For this purpose please contact IONOS Customer Service.

Required Steps After Replacing the Hard Disk

After replacing the defective hard disk, it is necessary that you rebuild the software RAID. For more information about rebuilding a software RAID, click here:

Rebuild Software RAID (Linux)

Content

Checking the Status of the Software RAID
Error Diagnosis and Finding the Necessary Data for Hard Disk Replacement
Interpretation of Parameters and Fault Diagnosis
Required Information for Hard Disk Replacement
Creating a SMART Log
Preparing a Server for Hard Disk Replacement
Arranging for Hard Disk Replacement
Required Steps After Replacing the Hard Disk
To top