Abstract
We propose a two-phase methodology for quantifying the performability (performance + availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to measure the impact of faults on the server’s performance. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the server’s performability. Using this model, evaluators can study the server’s sensitivity to different design decisions, fault rates and other environmental factors. To demonstrate our methodology, we study the performability of 4 versions of the PRESS web server against 5 classes of faults. We use Mendosus, a new fault-injection and network emulation infrastructure, to effect phase 1 of our methodology. We then use our model to quantify the gain or loss in performability as PRESS was modified for increasing performance. We also use our model to study the impact of reducing live operator support and adding RAIDs on PRESS’s performability.