Abstract
Much research has shown that cluster-based servers can substantially increase performance when nodes cooperate to share and globally manage their resources. In this paper, we apply a quantification methodology to show that this performance increase has a corresponding substantial cost in availability. Specifically, we show that a sophisticated cluster-based web server gains a factor of 3 in performance when nodes cooperate to balance load and jointly manage their memories, but also suffers an increase in unavailability of a factor of 10. We then show how this web server can be augmented with Commercial Off-The-Shelf (COTS) components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final predicted availability of close to 99.99.