Have you ever been working when the power goes poof and then you’ve lost all the data since you last saved? The day may be coming when you never have to worry about losing data again if a computer crashes thanks to a new “file system that could lead to computers guaranteed never to lose your data.” Big brains at MIT have come up with “the first file system that is mathematically guaranteed not to lose track of data during crashes.”
MIT's Electrical Engineering and Computer Science (EECS) Department announced that MIT researchers will present their new research on “Using Crash Hoare logic for Certifying the FSCQ File System” at the 25th ACM Symposium on Operating Systems Principles (SOSP'15) in October.
FSCQ is a novel file system with a machine-checkable proof (using the Coq Proof Assistant) that its implementation meets its specification, even under crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. FSCQ's theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data. To state FSCQ's theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash predicate, a recovery function, and name spaces (which allow specifying disk states at different abstraction levels in the file system). To help prove FSCQ's theorems, we chose a simple design for FSCQ. However, experiments with FSCQ running as a user-level file system show that even this simple design is sufficient to run Unix applications with usable performance.MIT explained, “Formal verification involves mathematically describing the acceptable bounds of operation for a computer program and then proving that the program will never exceed them. It’s a complicated process, so it’s generally applied only to very high-level schematic representations of a program’s functionality. Translating those high-level schema into working code, however, can introduce myriad complications that the proofs don’t address.”
Although high-level schema existed, MIT was the first to “prove properties of the file system’s final code.” Theirs was “not some whiteboard idealization that has no formal connection to the code,” Chlipala said.
According to Daniel Ziegler, an undergraduate in EECS who took part in the 2014 Chaos Communication Congress lecture ‘Now I sprinkle thee with crypto dust,’ “All these paper proofs about other file systems may actually be correct, but there’s no file system that we can be sure represents what the proof is about.”
“What many people worry about is building these file systems to be reliable, both when they’re operating normally but also in the case of crashes, power failure, software bugs, hardware errors, what have you. Making sure that the file system can recover from a crash at any point is tricky because there are so many different places that you could crash. You literally have to consider every instruction or every disk operation and think, ‘Well, what if I crash now? What now? What now?’ And so empirically, people have found lots of bugs in file systems that have to do with crash recovery, and they keep finding them, even in very well tested file systems, because it’s just so hard to do.”“In the course of writing the file system, they repeatedly went back and retooled the system specifications, and vice versa,” reported MIT. They “rewrote the file system ‘probably 10 times,’ Zeldovich said. “We’ve written file systems many times over, so we know exactly what it’s going to look like. Whereas with all these logics and proofs, there are so many ways to write them down, and each one of them has subtle implications down the line that we didn’t really understand.”
Frans Kaashoek, a Charles Piper Professor in MIT's EECS, estimated that the research team spent “90% of their time on the definitions of the system components and the relationships between them and on the proof.” He added, “No one had done it. It’s not like you could look up a paper that says, ‘This is the way to do it.’ But now you can read our paper and presumably do it a lot faster.”
Ulfar Erlingsson, lead manager for security research at Google, said, “It’s not like people haven’t proven things in the past. But usually the methods and technologies, the formalisms that were developed for creating the proofs, were so esoteric and so specific to the problem that there was basically hardly any chance that there would be repeat work that built up on it. But I can say for certain that Adam’s stuff with Coq, and separation logic, this is stuff that’s going to get built on and applied in many different domains. That’s what’s so exciting.”
Related research to read until your head explodes
Until their research is published, trying to get a handle on this deep subject is tough; it might include reading “Specifying Crash Safety for Storage Systems” (pdf) which includes numerous links to GitHub. In a nutshell, “Specifying crash-safe storage systems is challenging.”
Then there’s “Building a file system with FSCQ infrastructure” (pdf), a paper by Haogang Chen, a graduate student in the Parallel & Distributed Operating Systems (PDOS) Group at MIT’s Computer Science and Artificial Intelligence Laboratory (MIT CSAIL). According to Chen’s paper, “Although FSCQ isn’t as complete and high-performance as today’s high-end file systems, our results demonstrate that this is largely due to FSCQ’s simple design, and not any inherent limitations of certified software.”
You could read until your head explodes, but Chen, Chlipala, Kaashoek, Zeldovich and Ziegler’s research could change the course of history by making data loss after a crash an extinct problem. That won’t do away with the need to backup however as surely no file system, no matter how advanced, can stop hard drives from failing.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.