Next: Duplication Protocols
Up: Implementation Overview
Previous: Data Consistency
After the reboot of a failed node, all the data on this node
should be recovered. The recovery process in CEFT-PVFS is simple
and fast since all the data after the checkpoint can be directly
read from its mirrored server without doing any calculations. But
consistency must be carefully enforced to eliminate any
discrepancy between the primary and the backup caused by write
requests from clients during the duplication process. A simple
recovery method is to lock the primary server until the
duplication has finished. However, this will make the I/O services
unavailable for write requests (but still available for read
requests) during the recovery. In the current implementation of
CEFT-PVFS, the recovery process is designed by using
``copy-on-write'' on-line backup
techniques [19]. The functional server
will record the destination file names of every I/O write request
that happens during the recovery period and put them in a waiting
list. After finishing the duplication of the data after the
checkpoint from the functional server to the rebooted server, the
functional server will duplicate the files on the waiting list
again to the rebooted server to eliminate the possible
inconsistency caused during the recovery. As long as no files is
in the waiting list, the recovery process completes. On the
functional server, the recovery process holds a higher priority
than I/O service process to guarantee that the recovery will
eventually finish.
Next: Duplication Protocols
Up: Implementation Overview
Previous: Data Consistency
Yifeng Zhu
2003-10-16