Reliability and Availability Analysis

In this section, a Markov-chain model is constructed to analyze the reliability and availability of the four duplication protocols, and to compare their reliability with that of the PVFS.

Markov models have been used to analyze the reliability of RAID-1 in Ref. [26] [27] [28] [29] [30]. However, none of these models distinguishes the primary disk failures from the backup disk failures, i.e., they assume that all the data on a failure disk can be recovered from its mirror disk. This assumption holds true in a tightly coupled array of disks, such as RAID, because data on primary and backup disks is always kept consistent with the help of hardware. However, this assumption may not be true in our loosely coupled distributed system, such as clusters, in which the failure of a primary server and a backup server have different implications. For example, in Protocol 1, if a primary server fails before the completion of duplication, the backup server will lose the data that has not been duplicated. But the system does not lose any data if only a backup node fails. Therefore, in our system, the primary and the backup server nodes are not symmetrical in terms of their failure implications and the classic RAID model can not be used. In addition to being able to reflect the asymmetry, our model should be general enough so that the reliability of all four protocols can be derived directly. In the following sections, we take Protocol 1 as an example to show how the Markov-chain model is developed and how it can be applied to other protocols by appropriately changing some relevant definitions.

Tab. III presents some basic notations, while others will be introduced appropriately during the discussion.

**Table III:** Notation
	total number of nodes in one group
	total number of Markovian states
,	index of Markovian states, $1\leq i, j\leq S$
,	number of failed nodes, $0 \leq m, n \leq N$
$\lambda$	failure rate per node
$\lambda_s$	failure rate of the network switch
$\lambda_w$	arrival rate of write requests per server
$\mu$	repair rate per node
$\mu_d$	duplication rate
$MTTF_{node} = \frac{1}{\lambda}$	mean time to failure per node
$MTTF_{switch} = \frac{1}{\lambda_s}$	mean time to failure per switch
$MTTW = \frac{1}{\lambda_w}$	mean time to write
$MTTR_{node} = \frac{1}{\mu}$	mean time to repair per node
$MTTD = \frac{1}{\mu_d}$	mean time to duplicate
	mean time to data loss
	Markovian fundamental matrix
$Q=[q_{ij}]_{S \times S}$	Markovian truncated matrix
	probability that a primary node is consistent with its mirror node
$\hat{P}(mPnB)$	probability of the system being still functional when primary nodes and backup nodes have failed
${n \choose k} = \frac{n!}{(n-k)!k!}$	binomial coefficient