Next: Acknowledgments Up: Design, Implementation and Performance Previous: Optimization of the Tradeoffs

Conclusion and Future Work

In this paper, we demonstrate the feasibility and scalability of building a considerably reliable storage system with multi-terabyte capacity without any additional hardware cost in a cluster while maintaining high I/O performance to alleviate the I/O bottleneck for parallel applications.

Four different duplication protocols are designed to implement the fault tolerance in the parallel file system. Our experimental results and analytical analysis lead us to conclude that these protocols can improve the reliability over the original PVFS 40 times while degrading the peak write performance only around $33\%$ in the best case, and around $58\%$ in the worst case when compared with PVFS with the same total number of servers. In addition, these duplication protocols strike different balances between reliability and write performance. A protocol that has higher bandwidth is most likely to be inferior in reliability. Between the server-driven protocols, the asynchronous one achieves a write performance that is $27.7\%$ higher than the synchronous one, which comes at the expense of an average $5\%$ reliability degradation. Similarly, between the client-driven protocols, the asynchronous one has a write performance that is $14.7\%$ higher than the synchronous one, while paying a premium of an average $3.3\%$ reduction in reliability. We also proposed a hybrid protocol that optimizes the tradeoff between the reliability and write performance. In this hybrid protocol, if the total number of jobs of a data-intensive application is less than the server number of one storage group, the synchronous server duplication is used to mirror the data. Otherwise, the asynchronous client duplication is preferred.

None of the proposed protocols employs high-cost but more reliable techniques such as ``forced writes'' to the disks, and thus can potentially lose data if a disk or node fails while data is being copied from the I/O buffer (cache) on the processor to the disk. We will further investigate the tradeoff when considering ``forced writes''.

Next: Acknowledgments Up: Design, Implementation and Performance Previous: Optimization of the Tradeoffs

Yifeng Zhu 2003-10-16