Abstract
We consider the impact of different communication architectures on the performability (performance + availability) of cluster-based servers. In particular, we use a combination of fault-injection experiments and analytic modeling to evaluate the performability of two popular communication protocols, TCP and VIA, as the intra-cluster communication substrate of a sophisticated Web server. Our analysis leads to several interesting conclusions, the most surprising of which is, under the same fault load, VIA-based servers deliver greater availability than TCP-based servers. If we assume higher fault rates for VIA-based servers because the underlying technology is more immature and programming model more complex, we find that packet errors or application faults would have to occur at approximately 4 times the rate in TCP-based servers before their performability become the same. We also use results from the study to make suggestions for the design of a high-performance and robust communication layer for highly available cluster-based servers. More specifically, we argue that it should use messaging (not a byte stream), single-copy transfers, pre-allocated channel resources, and match the network fabric's fault model.