Towards mistake-aware systems

Abstract

The complexity of today’s enterprise computer systems poses a major challenge to system administrators, with a multitude of inter-related software components distributed in non-obvious ways across multiple computers. Not surprisingly, several studies have shown that human mistakes are an important source of outages and incorrect system behavior. To make matters worse, as computers permeate all aspects of our lives, higher demands are placed on the availability and correct operation of many computer systems. Given this state-of-affairs, we envisioned that systems must gracefully tolerate human mistakes made during system administration and operation. To realize our vision, we first studied human operator behavior and mistakes by means of live experiments with volunteers and a survey with database administrators. The results of this study led us to investigate a few techniques for dealing with mistakes, namely, validation of operator actions and model-based validation. Our research efforts culminate in a radically different approach, which we call mistake-aware systems management. We evaluate the effectiveness of validation of operator actions applied to databases, model-based validation, and mistake-aware systems management through a combination of live operator experiments, operator-emulation experiments, and mistake-injection experiments in a realistic prototype three-tier Internet service.Ph.D.Includes bibliographical referencesIncludes vitaby F´abio Abreu Dias de Oliveir

    Similar works

    Full text

    thumbnail-image