傳承科技資料救援中心 ~DescenDant Technical~: 硬碟故障的統計原因(Failure Trends in a Large Disk Drive Population)

2009年10月12日星期一

硬碟故障的統計原因(Failure Trends in a Large Disk Drive Population)

此文章統計的是硬碟壽命參考的數據,FYI
這篇研究的對象是，十萬台 PATA/SATA, 5400~7200 rpm, 80~400GB 的硬碟。
研究期間為 2005年12月 ~ 2006年8月。
Vendor MTBF and Google AFR
Mean Time Between Failure(MTBF)
是硬碟廠商提供的硬碟壽命參考數據。如果廠商規格提供 300,000 MTBF，則可以預期在大量的同型硬碟中，有一半的數量會在使用 300,000 小時前壞掉。但是，MTBF沒有告訴我們，剩下的硬碟還能運作多久。
理想中，如果我們有 600,000 顆 300,000 MTBF 的硬碟，會預期每個小時就會壞掉一顆。
一年中，就有 8,760顆硬碟壞掉，換算成 Annual Failure Rate (AFR) 的話，就是 1.46% (8,760 / 600,000)。

Manufacturer’s MTBF specs 其實，廠商得出的 MTBF 數據與現實世界有一些差距，所以我們常常發現這些實驗室的數據拿到現實世界時，壽命並不如數據來得長。
這是因為廠商的實驗環境。首先，他們的實驗環境因素，並沒有辦法完全反映真實世界的環境。第二，實驗數據依賴的硬碟錯誤回覆，只是眾多回覆中的一組資料，因此當接收到回覆正常時，並不代表此硬碟可以正常運作，因為壞掉的原因有很多種。
因此，MTBF 只能說是現實世界的底線或最佳情形。
How smart is SMART?

SMART (Self-Monitoring, Analysis, and Reporting Technology) 是設計用來偵測硬碟是否正常的技術。通常 SMART 被認為以下四項的偵測結果與硬碟壞掉的比率有明顯的正相關：
» scan errors
» reallocation count
» offline reallocation
» probational count
Google 發現，只有第一項有顯示正相關，即他們發現當硬碟第一次出現 scan errors 時，在往後 60 天內壞掉的機率是正常硬碟的 39 倍。除了 scan errors，其它都沒有明顯的正相關。
因此，SMART 能夠警告的訊息有限，不能夠太依賴它。例行的備份還是王道，如果 SMART 丟出任何一個警告訊息，還是壞一顆硬碟吧。
Over work = early death?

一般人認為讀寫忙碌的硬碟，其壞掉的比率較高。但是 Google 發現不一定都是如此。
在硬碟使用一年後，中等忙碌的硬碟壽命較不忙碌的硬碟長。使用到第三年時，不忙碌的硬碟反而是最容易壞掉的。

Sudden heat death?
一般人也認為溫度是造成硬碟壞掉的重要兇手之一。但是 Google 發現，太低的溫度也不好。平均而言，25~35度是最佳的溫度，且若在使用未達一年的硬碟，最佳溫度是35~45。

以下是原文：
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso
Abstract

It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drivea€™s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007), San Jose, CA, February 2007
Download: PDF Version

傳承科技資料救援中心

電話:：( 02 ) - 2885 - 2078　

行動:0937-140949 , 0937093874 統一編號:29072143

E-mail:Radius@Livemail.tw http://www.radius.tw

Posted via email from radius's posterous

傳承科技資料救援中心 ~DescenDant Technical~

傳承科技贊助商

最新文章

傳承科技資料救援,磁碟陣列RAID搶救's Fan Box

我的網誌清單

2009年10月12日星期一

硬碟故障的統計原因(Failure Trends in a Large Disk Drive Population)

0 回應:

傳承科技簡介

傳承科技相關連結

硬碟原廠連結

「傳承科技Face粉絲Blog」

Linux About

Linux-related news

SQL PHP SERVER C/C++

Shank Link

歷史記錄

傳承聯播連結

傳承科技 資料救援中心 ~DescenDant Technical~

傳承科技贊助商

最新文章

傳承科技資料救援,磁碟陣列RAID搶救's Fan Box

我的網誌清單

2009年10月12日 星期一

硬碟故障的統計原因(Failure Trends in a Large Disk Drive Population)

0 回應:

傳承科技簡介

傳承科技相關連結

硬碟原廠連結

「傳承科技Face粉絲Blog」

Linux About

Linux-related news

SQL PHP SERVER C/C++

Shank Link

歷史記錄

訂閱

傳承聯播連結

傳承科技資料救援中心 ~DescenDant Technical~

2009年10月12日星期一