Preempting Hard Drive Failure with S.M.A.R.T.
Almost every modern hard drive comes with a feature that can help your staff preempt many device failures. Self-Monitoring, Analysis and Reporting Technology, or S.M.A.R.T. is a handy tool for keeping an eye on hard drive health in both data center and terminal environments. S.M.A.R.T. collects data to observe the performance of the motors, disk platters, read/write heads, and other device electronics. The HDD uses that information to compare against performance degradation trends that indicate imminent device failure.
What Dead Drives Mean to You
A failed hard drive is at minimum an inconvenience and at worst a catastrophe in both data center servers and individual computers. HDDs don’t necessarily stop working entirely when they fail, but often start to experience performance problems that indicate hardware failure is in the cards. This means an end-user trying to recall data from a bad disk sector may have to wait several minutes instead of seconds to get the same work done. In a worst case scenario, a failed HDD means losing all the data stored on it.
The Metrics That Matter
Experts are often critical about the S.M.A.R.T. concept because it tracks a lot of data that isn’t necessarily relevant to detecting HDD failure, a handful of the tracked metrics actually are deterioration red flags. According to Google’s study “Failure Trends in a Large Disk Population,” S.M.A.R.T. systems that check for scan errors, probational counts, reallocation counts, and offline reallocation counts are accurate impending failure detecting metrics. Additionally, reported uncorrectable errors, command timeouts, the current pending sector count, and the uncorrectable sector count also provide insight into potential device failure.
According to Backblaze, there is a strong correlation between uncorrected reads and HDD failure. Most drives spend the duration of their use time reporting zero uncorrected reads, so as soon as they start showing up it is an early indicator that the device is progressing towards failure. The more errors, the more likely failure is imminent.
However, HDD still fail without any tell-tale signs. The system works best to identify failing drives, not insurance to detect all cases. Critics point out that the tracked metrics vary between manufacturers. Despite inconsistent metrics used across different HDDs, the Google study did not find statistically significant changes in accuracy rates between manufactures.
Take Action When Warned
Both the operating system and BIOS can access S.M.A.R.T. data and may warn you automatically if there is a detected problem. You can view the S.M.A.R.T. diagnostic information with a range of desktop applications including the easy-to-use SpeedFan and CrystalDiskInfo monitoring tools. When you pull up the data, the program will display a range of tests that gauge the current performance and track the worst recorded performance to measure against the “Threshold” or failure limit designated by the manufacturer for the device. Getting a S.M.A.R.T. alert means it’s time to take immediate action. Backup the data on the drive and replace it as soon as possible. You may be able to save the data if you take it out of server use immediately, as continued use may destroy the drive, potentially costing your business even more time and money.
Comments are closed