Estimating system availability and reliability
Abstract
Methods for constructing and solving large Markov chain models of computer system availability and reliability are addressed. A set of powerful high-level modeling constructs is discussed that can be used to represent the failure and repair behavior of the components that constitute a system, including important component interactions. If time-independent failure and repair rates are assumed, then a time-homogeneous continuous-time Markov chain can be constructed automatically from the modeling constructs used to describe the system. Since the size of a Markov chain grows exponentially with the number of components modeled, simulation appears to be a practical way for solving models of large systems. However, the standard simulation requires very long simulation runs to estimate availability and reliability measures because the system failure event is a rare event. Therefore, variance reduction techniques which can aid in computing rare-event probabilities quickly are of interest. The importance sampling technique has been found to be most useful. The modeling language and the simulation methods discussed have been implemented in a program package called the System Availability Estimator (SAVE).