next up previous
Next: Duplication Protocols Up: Implementation Overview Previous: Data Consistency

Data Recovery

After the reboot of a failed node, all the data on this node should be recovered. The recovery process in CEFT-PVFS is simple and fast since all the data after the checkpoint can be directly read from its mirrored server without doing any calculations. But consistency must be carefully enforced to eliminate any discrepancy between the primary and the backup caused by write requests from clients during the duplication process. A simple recovery method is to lock the primary server until the duplication has finished. However, this will make the I/O services unavailable for write requests (but still available for read requests) during the recovery. In the current implementation of CEFT-PVFS, the recovery process is designed by using ``copy-on-write'' on-line backup techniques [19]. The functional server will record the destination file names of every I/O write request that happens during the recovery period and put them in a waiting list. After finishing the duplication of the data after the checkpoint from the functional server to the rebooted server, the functional server will duplicate the files on the waiting list again to the rebooted server to eliminate the possible inconsistency caused during the recovery. As long as no files is in the waiting list, the recovery process completes. On the functional server, the recovery process holds a higher priority than I/O service process to guarantee that the recovery will eventually finish.


next up previous
Next: Duplication Protocols Up: Implementation Overview Previous: Data Consistency
Yifeng Zhu 2003-10-16