SSDs more reliable than HDDs? Study shows similar failure rate

Latest statistics from Backblaze suggest SSD reliability may not be significantly better than hard drives at all

In May, we had reported on a study by Backblaze, which publishes statistics on the failure rate of hard drives operated by its service. Backblaze also started reporting failure statistics for the SSDs it uses as system drives and the results at first seemed to be orders of magnitude better. But their new data now shows a picture much less promising, it looks like SSD fatality rates might not eventually be so different from HDDs.

At first glance, it may seem that SSDs are a lot more reliable, as a previous report from Backblaze (and here’s our article about on it) stated. Backblaze, for example, has a statistic that says the overall failure rate of HDDs that the company runs in its servers as boot drives is 6.41% per year (annualized). So, during one year, around 6.5% of installed drives fail. For SSDs used as boot disks, the statistics for failed devices so far is only a 1.05% failure rate per year.

Backblaze has 1666 system SSDs and 1607 system HDDs in its data centers, so the ratio is virtually fifty-fifty. But there’s a methodological problem that can skew the SSD failure rate and make it look much rosier than it actually is. The company started deploying SSDs late, only sometime in 2018, when small capacity (240-256 GB) drives became very cheap. Therefore, the average age of SSDs is only 14.2 months (the oldest are 33 months old), while HDDs are all from previous years and are 52.4 months or almost 4.5 years old on average (i.e. many will be even older). Even the youngest HDDs are at least 27 months old.

Annualized HDD and SSD failure rate over the lifetime of the system (Source: Backblaze)

The mortality rate is high at first glance for HDDs, the 6.41% per year means a total of 619 discs were discarded during the time covered by the statistics. However, this is for 3,523,610 days of service. The failure count of SSDs in Backblaze so far is only 17 units, which looks like an awfully good number in comparison, but this is for 591,501 days of running. While the shorter runtime is already accounted for in the number showing annualized mortality, this count still ignores an important point: namely, for HDDs, the failure rate increases with age, as the drives get worn out from continuous server operation (Backblaze did not use server models designed for 24/7 operation for this task).

By having mostly old HDDs in the stats, but relatively new SSDs can cloud the comparison a lot, as you will see. Incidentally, Backblaze says that they will also release a study sometime in the future on how HDD reliability degrades with age, which will probably be a very informative reading.

SSDs and HDDs of the same age have similar failure rates

Anyway, for a better comparison, Backblaze tried to examine HDD failure rate data only for the time when their average age was similar to that of SSDs now, and compare it to that period. This was the period up to Q4 of 2016, when system HDDs reached an average age of 14.3 months, there were 1,297 of them in the fleet, and they had a combined mileage of 659,526 days. And for this period, suddenly a completely different reliability characteristic comes out: only 25 drives died during this period and their failure rate comes out to only 1.38% per year.

HDD and SSD failure rates when comparing drives of roughly the same age(Source: Backblaze)

So suddenly the numbers are actually on the same order of magnitude: if HDDs and SSDs are the same age and within the first one to three years of their life, they die at pretty much the same rates. While a 1.38% failure rate is a third more HDD failures than the 1.05% failure rate per year for SSDs, there is one thing that speaks a bit in favor of HDDs. HDD failure in this statistic means either sudden total bricking (where you lose all your data) or also pre-emptive HDD retirement, which Backblaze performs based on warning signals in the SMART statistics (bad/reallocated sectors, but also others). In these cases the HDD is still functional and at least most of the data can be read, which is a less severe form of failure. For SSDs, the company does not yet remove drives from service pre-emptively due to lack of experience with their behavior and all recorded SSD failures are therefore the worst case of total sudden death (when the data could not be read from the SSD). None of the SSD failures in this statistic were due to NAND write cycles exhaustion according to Backblaze, by the way.

Failure rates increase significantly for HDDs that are several years old

To illustrate how age matters, Backblaze has the following graph, where HDD mortality with advancing time in use is plotted in blue and SSD mortality in orange. As you can see, except for perhaps a 33% better SSD reliability (possible statistical error and the previous paragraph asside), the curves are very similar to begin with. If HDDs hadn’t kept on being used despite their increasing age, they wouldn’t have fared so much worse in 2014 to 2017 than SSDs in 2018 to 2021. And it looks like SSDs might start to see their failure rate increase a bit with age as well. It remains to be seen if SSDs could actually run out of their erase/rewrite cycle life under high load, if used for several years?

Evolution of HDD and SSD failure rates with age (Source: Backblaze)

So the Backblaze data cannot yet be used to say that SSDs are orders of magnitude (or at least 6x, as the number quoted in the introduction said) more reliable. But the big question is how SSD failure rates will evolve going forward. After all, Backblaze’s statistics end right where the HDD knee bent upwards. For H1 2021 (not a full year yet), we see some deterioration in SSDs, but it may not be a significant change, it could be both a statistical error, or a warning start of a trend of worsening failure rates. Only the next year or two will tell whether the curve will indeed replicate the significantly worsening prospects of HDDs.

Lesson: SSD is no guarantee of anything, always back up important data

What do we take from this now? Personally, I’d probably be a mild optimist and not assume that SSD failure rates will get exactly as bad with age as the HDD blue curve. It’s still true that SSDs are less susceptible to certain things due to the lack of mechanical parts after all, and while there is that threat of limited write life, the the combined risk from various failure modes is probably really lower overall with these storage devices. It’s also important to note that there is a much larger number of manufacturers of SSDs, including a lot of flaky brands from China, while the mere three remaining HDD manufacturers probably all have relatively higher quality standards than lots of low-tier SSD makers. Many virtually “no name” SSDs can be so poorly designed and manufactured (both firmware and hardware-wise) that they can have extremely degraded failure rates or shortened lifetimes.

But we simply don’t yet know for sure what will the real failure rate of SSDs that are several years old turn out to be in this statistic. It’s quite possible that instead of the 6× better reliability of SSDs, they will only turn out to be maybe triple or even just double the reliability of HDDs. Of course, that will perhaps complicated by the fact that when an SSD fails, it usually turns into a “brick” with 100% data loss, whereas with HDDs it’s quite common that you throw the HDD away when bad sectors start to spread on it or you notice a warning SMART status, but you still have time to save your data in those scenarios.

The conclusion is that you should not overestimate the reliability of SSDs. Don’t think these storage devices can’t just suddenly stop working. So even with these drives, make sure you always have your data safely backed up. Anything that is important that you can’t or don’t want to lose must be stored independently somehow (look into backup guidelines and best practices and remember that a “redundant” RAID 1 array, for example, can be destroyed by hardware or software failure all at once, so having data on a RAID 1 volume does not remove the for back-ups).

Obviously, when you have a lot of data, it may not be economically viable to preserve a second copy of everything somewhere. But tven if you are willing (or forced) to accept the risk of loss for some things, you should always make sure to determine which parts of your data is really important or possibly emotionally valuable data and back up that selection in some way. (With this we wish you that disk failures avoid you and you don’t have to deal with data loss.)

Source: Backblaze

English translation and edit by Jozef Dudáš, original text by Jan Olšan, editor for Cnews.cz



  •  
  •  
  •  
Flattr this!

A homemade solution for undesirable HDD noise

The golden era of HDDs is already gone, and we still haven’t seen any real solution for one of their basic problems. A mounting system that would absorbs undesirable vibrations. The difference in noise levels can be significant. We’ve created just the thing to find out whether it’s even possible to get rid of those vibrations completely. Read more “A homemade solution for undesirable HDD noise” »

  •  
  •  
  •  

Test of six HDDs from WD: which colour is for you?

What disk is best for an office work, photos, videos, operating systems, games, and for demanding servers? And is the Red really well designed for network storage? You can find out in this practical comparison of the performance, consumption, and noise levels of each model from WD. The differences are significant and the results surprising. Green, blue, red, purple, black, or gold? Read more “Test of six HDDs from WD: which colour is for you?” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *