SSDs <i>do</i> die, as Linus Torvalds just discovered
- 12 September, 2013 18:41
Linus Torvalds found out the hard way that solid-state drives (SSDs) aren't invincible -- and when they do fail, they can die without warning and at inconvenient times.
The creator of the Linux kernel blogged this week that the SSD in his workstation simply stopped working, interrupting his work on the Linux 3.12 kernel.
"The timing absolutely sucks, but it looks like the SSD in my main workstation just died on me," Torvalds wrote. "I had pushed out most of my pulls today, so realistically I didn't lose a lot of work."
While SSDs are vastly better performers than hard disk drives and are considered more reliable for mobile devices because they have no mechanical parts to break, they do have a limited lifespan. With some early SSDs, that lifespan ended up being less than a year, depending on the quality and use of the drive.
As an investigation into SSD reliability performed by Tom's Hardware noted: "We know that SSDs still fail.... All it takes is 10 minutes of flipping through customer reviews on Newegg's listings."
While there are no moving parts in an SSD, the semiconductor components can fail. For example, a NAND die, the SSD controller, capacitors, or other passive components can -- and do -- slowly wear out or fail entirely.
While Torvalds didn't specify the SSD manufacturer in his blog, he did write in a 2008 blog that he'd purchased an 80GB Intel SSD, likely the X25, which has become something of an industry standard for SSD reliability. The early X25's were built on top of the highest quality NAND flash chips available at the time.
Anecdotally, various editors at Computerworld have experienced failures related to OCZ-brand SSDs. But considering Torvalds' SSD may have been six years old, it would be hard to criticize its endurance - especially when it was being used in a workstation.
"I think the best way to describe SSD reliability is that thanks to controller maturation, average product endurance is improving and the standard deviation is falling," said Ryan Chien, an SSDs and Storage analyst with IHS's Electronics & Media division.
Although most client drives outlast their three-to-five year warranties, if Torvalds was subjecting such a drive to heavier workstation-type workloads, which happens a fair bit in enterprise, the lifespan likely will not meet expectations," Chien said.
Multiple factors affect SSD reliability, according to Jeff Janukowicz, research director for SSD and Enabling Technologies at IDC.
The NAND flash media plays a key role, as its quality differs between manufacturers. And earlier generations of NAND flash have lower endurance characteristics related to bit errors -- when electrons leak through cell walls -- and program disturbs. A program disturb is the unintentional programming of a memory cell. Do it enough, and endurance suffers.
Also, there are several flavors of NAND flash: single-level cell NAND writes just one bit per transistor, giving it innately greater performance and endurance; multi-level cell flash writes two bits per cell, which wears memory out more quickly; and most recently, 3-bit or triple-level cell flash has added yet another bit to the equation, further degrading native endurance.
For example, Samsung's 840 EVO SSD uses TLC memory, yet because of the sophistication of the controller chip and its software, it will outlast any other component of the laptop or desktop it's in, according to Chris Geiser, senior product manager of Samsung's Memory and Storage Division.
"If I'm writing 10GB a day to a 120GB SSD, it will last over 10 years," Geiser said.
Unlike hard disk drives, all SSDs slow down after initial use because once a sufficient amount of data has been written to them, the processor in the drive begins to move data around -- a function known as the read-modify-erase-write (erase-write) cycle. So each time new data is written to the SSD, data must first be marked for deletion before new data can be written. Over time, the cells or transistors in NAND flash wear out due to the erase-write cycle.
SSD makers have increased the sophistication of error correction and 'wear leveling' software, which works to more evenly spread data writes across a drive so as to not "wear out" any block of cells more quickly than another. But, eventually they all wear out.
The SSD controller and the firmware is where error correction code (ECC) and wear-leveling take place. In general, Janukowicz said, both the controller and firmware are what differentiate an SSD and its performance/reliability compared to a USB thumb drive, which also uses NAND flash.
The sophistication of the SSD controller, such as the digital signal processing algorithms used and the level of ECC, helps mitigate some of the intrinsic challenges of NAND, Janukowicz said.
As NAND process shrinks in size -- that is, as the transistors become smaller and smaller to accommodate greater density and capacity -- firmware must compensate for the increase in errors. (The smaller cells or transistors get, the more likely data errors will occur.) NAND flash process technology has shrunk from 35 nanometers (nm) a few years ago, to under 19nm today.
"From the data I've seen, client SSD annual failure rates under warranty tend to be around 1.5%, while HDDs are near 5%," Chien said.
So the bottom line is that SSDs will fail -- even if you're Linus Torvalds -- but they are still more reliable and much faster than hard disk drives.
Lucas Mearian covers storage, disaster recovery and business continuity, financial services infrastructure and health care IT for Computerworld. Follow Lucas on Twitter at @lucasmearian or subscribe to Lucas's RSS feed. His e-mail address is email@example.com.
Read more about ssd in Computerworld's SSD Topic Center.