The rate at which hard disks fail appears to be far greater than the manufacturersi estimated mean time to failure (MTTF) data predicts. Factors that could lead to the MTTF being too simplistic were looked at in a study of 100,000 drives conducted by Carnegie Mellon University and presented at the 5th Usenix Conference last month.
All types of drives were examined from high performance computing sites and Internet service sites. The mean time to failure (MTTF) of these drives predicted in vendor data sheets was from 1 M to 1.5 M hours and would have produced a failure rate of 0.88 percent. In fact, typical replacements rates were two to four percent.
There was little difference in the failure rate amongst various kinds of drives: SCSI, Fibre Channel and SATA drives.A key finding had to do with the non-linearity of the expected failure: "One aspect of disk failures that single-value metrics such as MTTF and AFR cannot capture is that in real life failure rates are not constant ... Failure rates of hardware products typically follow a "bathtub curve" with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle."
One important conclusion of the study was that "The common concern that MTTFs underrepresent infant mortality has led to the proposal of new standards that incorporate infant mortality .... Our findings suggest that the underrepresentation of the early onset of wear-out is a much more serious factor than underrepresentation of infant mortality and recommend to include this in new standards."
The full technical report is available at the Usenix site.