An attractive approach to alleviate the I/O bottleneck in clusters is to utilize the commodity disks that already exist as an integral part of each cluster node. Such disks in a large cluster collectively form a terabyte-scale capacity, which may satisfy the high I/O requirements of many data-intensive scientific applications when provided with efficient and reliable access. In fact, with the emergence of high-bandwidth and low-latency networks, such as the Myrinet and Gbps Ethernet, these independent storage devices can be connected together to potentially deliver high-performance and scalable storage.
Parallel file systems can in theory significantly improve the performance of I/O operations in clusters. One major concern of this approach is the fault-tolerance (or lack thereof). Assume that the Mean Time To Failure (MTTF) of a disk is three years and all the other components of a cluster, such as network, memory, processors and software, are fault-free, the MTTF in a parallel file system with 128 server nodes will be reduced to around nine days if the failure of storage nodes is independent. Moreover, the MTTF will be further reduced when the failures of the other components are considered. Similar to disk arrays [1], without fault tolerance, these parallel file systems are too unreliable to be useful.
In this paper, we incorporate fault-tolerance into parallel file system by mirroring. More specifically, we present our design, implementation and performance evaluation of a RAID-10 style, cost-effective and fault-tolerant parallel virtual file system (CEFT-PVFS), an extension to the PVFS [2]. The analytical modelling results based on Markov process show that CEFT-PVFS can improve the reliability of the PVFS by a factor of over 40 times (4000%). While the mirroring scheme degrades the write performance by doubling the data flow, four mirroring protocols are designed with different write access schemes to achieve different tradeoffs between reliability gain and performance degradation. The write bandwidths of these four protocols are measured in a 128-node cluster while their reliability is evaluated by a new analytical model developed in this paper. Finally, a hybrid mirroring protocol is proposed to optimize the balance between the write performance (bandwidth) and the reliability.
The rest of this paper is organized as follows. We first discuss the related work in Section II. Then the design and implementation of our CEFT-PVFS are presented in detail in Section III. Section IV describes four different mirroring protocols and Section V shows the write performance of these protocols under a microbenchmark. In Section VI, a Markov-chain model is constructed to analyze the reliability and availability of these protocols. Finally, Section VII presents our conclusions and describes possible future work.