The Choice of Fault Tolerance Designs

Next: Design of CEFT-PVFS Up: Implementation Overview Previous: Implementation Overview

The Choice of Fault Tolerance Designs

There are several approaches to provide fault tolerance in parallel file systems. One simple way is to strip data on multiple RAIDs that are attached to different cluster nodes. However, this approach provides moderate reliability since it cannot tolerate the crash of any cluster nodes.

Another possible approach is to provide the redundancy by computing the parity in the same way as RAID-5. RAID-5 can tolerate any self-identifying device failure while self-identification may not be available in a typical cluster environment. In addition, a small RAID-5 write involves four I/Os, two to pre-read the old data and parity and two to write the new data and parity [11]. In a loosely coupled system, such as clusters, the four I/Os cause a large delay. Finally, in a distributed system, the parity calculation should not be performed by any single node to avoid severe performance bottleneck; instead, it should be performed distributively. However, this distributed nature complicates the concurrency control since multiple nodes may need to read or update the shared parity blocks simultaneously.

Still another possible approach is to use erasure coding, such as Rabin's Information Dispersal Algorithm (IDA) [12] and Reed Soloman Coding [13], to disperse a file into a set of pieces such that any sufficient subset allows reconstruction. Consequently, this approach is usually more space-efficient and reliable than RAID-5 and mirroring. While the erasure coding is extensively used in P2P systems [14], it not suitable for GB/s scale cluster file systems since the dispersal and reconstruction require matrix multiplications and multiple disk accesses and generate a potentially large computational and I/O overhead.

In CEFT-PVFS, we choose to use mirroring to improve the reliability. As the storage capacity increases exponentially, the storage cost decreases rapidly. By August 2003, the average price of commodity IDE disks has dropped below 0.5 US$/GB. Therefore, it makes perfect sense to ``trade'' $50\%$ storage space for performance and reliability. Compared with the parity and erasure coding style parallel systems, our approach adds the smallest operational overhead and its recovery process and concurrency control are much simpler. Another benefit from mirroring, which the other redundancy approaches can not achieve, is that the aggregate read performance can be doubled by doubling the degree of parallelism, that is, reading data from two mirroring groups simultaneously[15].

Next: Design of CEFT-PVFS Up: Implementation Overview Previous: Implementation Overview

Yifeng Zhu 2003-10-16