Diagnose and Replace a Defective Hard Drive (Windows Dedicated Server with Hardware Raid)

In this article, you will learn how to identify a defective hard drive and prepare your server for the replacement.

Prerequisite

This article has been created for customers who have at least a basic knowledge of Windows server administration. If you have any questions or need help with the drive replacement, please contact Customer Service.

To give yourself the best performance, you have to make sure that you monitor the hardware RAID of your dedicated server. If you find that a hard drive is defective or receive a notification email about a defective hard drive, you will have to contact Customer Service to arrange for the replacement. To do this, you will first have to identify the defective hard drive and prepare the server for the exchange.

Proceed with caution!

RAID systems enable greater reliability and/or higher speeds. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up your data regularly. Also make sure that you back up your data before performing the following steps to ensure the security of your data.

For more information on creating backups, click here:

Backup Solutions

Hardware RAID Controllers: General Information

A hardware RAID controller is a physical controller that is built into the server as a hardware component. This controller has its own processor for the calculation of RAID operation, and the processor organizes and manages the memory space. Accordingly, the CPU of the server is not burdened by RAID calculations. For hardware RAID controllers, the RAID functionality is also independent of the operating system. They are managed by special Command Line Interface (CLI) programs, which can vary depending on the manufacturer and model.

Diagnosiing Hard Drive Errors

In order to detect hard drive errors, we recommend that you use the smartctl program.

Smartctl is a command line program for monitoring volumes using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program, you can check whether a hard drive is defective. It is a component of the Smartmontools.

A list of supported hardware controllers can be found here:

https://www.smartmontools.org/wiki/Supported_RAID-Controllers

 

Install Smartctl

You can download the Smartmontools on the following page:

https://www.smartmontools.org/wiki/Download#InstalltheWindowspackage

Identifying Hardware RAID Controllers

How to check which hardware RAID controller is built into your server:

  • Open the Control Panel.

  • Click Hardware > Devices and Printers > Device Manager.

  • In the Memory Controller section, check which controller is installed in the server.

 

Checking the Status of the Hardware Raid

Information on checking the status of the hardware raid can be found here:

Hardware RAID Monitoring / Rebuilding (Windows)

If a disk is missing in the raid array, it may be faulty or broken. A defective RAID could look like this:

CLI> rsf info
# Name Disks TotalCap FreeCap DiskChannels State
====================================================================================================================================================================================================
1 Raid Set # 00 3 2250.
In the above example, disk 2 has the status incomplete. This indicates a defect.

Viewing Hard Drive Information

Smartctl behaves the same in Windows and Linux. Because of this, you can use the same commands. To use Smartctl for troubleshooting, you must open the command prompt and change to the directory where the Smartmontools are located.

To use Smartctl to access hard drive information, you must always specify the appropriate command in combination with an option and a target device. The target device depends on the controller manufacturer.

Use the commands listed below to call up the information required for diagnosis via the hard drive:

Manufacturer Hard disk Command
ARECA 1 smartctl -iHAl error /dev/sg1 -d areca,1
ARECA 2 smartctl -iHAl error /dev/sg1 -d areca,2
LSI / 3Ware 1 smartctl -iHAl error /dev/twe0 -d 3ware,0
LSI / 3Ware 2 smartctl -iHAl error /dev/twe0 -d 3ware,1
Adaptec 1 smartctl -iHAl error /dev/sg2 -d sat
Adaptec 2 smartctl -iHAl error /dev/sg3 -d sat
Adaptec (3) smartctl -iHAl error /dev/sg4 -d sat
Adaptec (4) smartctl -iHAl error /dev/sg5 -d sat
Dell 1 smartctl -iHAl error -d sat+megaraid,0 /dev/sda
Dell 2 smartctl -iHAl error -d sat+megaraid,1 /dev/sda
Broadcom 1 smartctl -iHAl error -d sat+megaraid,0 /dev/sda
Broadcom 2 smartctl -iHAl error -d sat+megaraid,1 /dev/sda

Additional commands for supported hardware controllers can be found on the following page:

https://www.smartmontools.org/wiki/Supported_RAID-Controllers

Example:

C:\Program Files\smartmontools\bin>smartctl -iHAl error /dev/sg1 -d areca,1

smartctl 7.0 2018-12-30 r4883 [x86_64-w64-mingw32-2016] (sf-7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   141   140   021    Pre-fail  Always       -       3933
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
 16 Gas_Gauge               0x0022   000   200   000    Old_age   Always       -       1822115874
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   113   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Interpretating the Data

The first section lists characteristic information about the hard drive. In this section, you will find the device model, the serial number and the size of the tested hard disk:

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K2
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In the second section the current state of the hard disk is evaluated by Smartctl. If, for example, the value Failed or UNKNOWN is displayed instead of the value PASSED, you should replace the hard disk as soon as possible.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

In the third section, the SMART VALUES determined are listed in detail. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective limit value (THRESH) are listed. If the current percentage value (VALUE) or the worst ever measured value (WORST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   141   140   021    Pre-fail  Always       -       3933
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
 16 Gas_Gauge               0x0022   000   200   000    Old_age   Always       -       1822115874
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   113   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

The following parameters can indicate an imminent hard drive failure before a SMART warning is displayed:

Reallocated_Sector_Ct: Specifies the number of sectors reassigned due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically assigned to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign for incipient surface problems. If this value is not equal to zero, a hard drive failure is often imminent. This value is the most important indicator for a hard drive replacement.

Current_Pending_Sector_Ct: Specifies the number of unstable sectors waiting for remapping. If a sector cannot be read and written correctly, it first receives the status Current Pending Sector. The sector is not reassigned in this state, since the data in the sector are unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned and the defective sector permanently marked as unreadable. The value Current_Pending_Sector_Ct is an important indicator for a hard drive replacement. If this value is not equal to zero, a hard drive failure is often imminent.

Offline_Uncorrectable: Specifies the number of uncorrectable write and read sector errors.

The last section deals with the internal hard drive log. Errors are recorded here if the server's work orders have not been processed correctly from the hard drive. If the number of errors in this section is at least two digits, you should replace the hard drive das soon as possible.

SMART Error Log Version: 1
No Errors Logged

Preparing for a Hard Drive Replacement

Viewing Detailed Information for Drive Replacement

The following information is required in order to replace the defective hard drive:

  • Name of the hard drive in the RAID

  • Serial number

  • Model

  • Log file (optional)

 

Creating a SMART Log

Use the commands listed below to generate a complete SMART log:

Manufacturer Hard disk Command
ARECA 1 smartctl -x /dev/sg1 -d areca,1
ARECA 2 smartctl -x /dev/sg1 -d areca,2
LSI / 3Ware 1 smartctl -x /dev/twe0 -d 3ware,0
LSI / 3Ware 2 smartctl -x /dev/twe0 -d 3ware,1
Adaptec 1 smartctl -x /dev/sg2 -d sat
Adaptec 2 smartctl -x /dev/sg3 -d sat
Adaptec (3) smartctl -x /dev/sg4 -d sat
Adaptec (4) smartctl -x /dev/sg5 -d sat
Dell 1 smartctl –x -d sat+megaraid,0 /dev/sda
Dell 2 smartctl –x -d sat+megaraid,1 /dev/sda
Broadcom 1 smartctl –x -d sat+megaraid,0 /dev/sda
Broadcom 2 smartctl –x -d sat+megaraid,1 /dev/sda

If the SMART log was created as described above, it will contain all of the information you need. You can then have the defective hard drive replaced. To get this done, please contact 1&1 IONOS Customer Support.

If you cannot find the serial number of the defective hard drive using smartctl, you can alternatively provide customer service with the serial number of the functioning hard drive(s).

If you are unable to determine the information required for the replacement and wish to replace the hard drive, the hardware must be checked before replacing it. During this check, the server usually becomes temporarily unavailable. If a defect in the hard drive is detected during this test, it will need to be replaced.

Arranging for a Hard Drive Replacement

You can then have the defective hard drive replaced. To do this, please contact 1&1 IONOS Customer Support.

Steps to Take After Replacing the Hard Drive

After the defective hard drive has been replaced, the RAID system has to be rebuilt, which usually starts automatically. Please make sure that the rebuild of the RAID system starts and completes successfully.