AutoMon. Automatic monitoring and problem detection for distributed systems

Abstract

When working with distributed systems, detecting faults can be a difficult task, as abnormalities isn't necessarily immediately evident by warnings or system crashes. This is especially true with subtle faults, such as variations in performance of a running program, it is not necessarily its own fault, but could rather be from a different source, somewhere in the cluster, using a lot of resources (CPU, IO, etc.), thereby causing other programs to perform sub-par compared to earlier executions. These types of problems won't necessarily be detected by regular cluster monitoring tools, as these only look at cluster metrics, or by distributed debuggers, as these only monitor specific programs, and thus won't find the cause for the degraded performance if it comes from a different source. As the usage of distributed systems is becoming more common amongst those without an intimate knowledge about these systems, being able to quickly inform the user about any faults or abnormalities, would be a great improvement on their efficient use of the system. It would additionally be a great help to developers, as they could easily get their programs performance data without implementing specific procedures for the task, thus simplifying the development of new distributed software. This thesis is looking to discover if the system, and process, information attainable from each nodes operating system, is enough to detect abnormal operation. This is approached by creating a prototype system that collects this information from the cluster, and doing analysis on the data during runtime to check for faults. The achieved system is capable of collecting large amounts of data from the cluster, storing it, and doing some rudimentary analysis on the data. While leaving most of the clusters resources free for its computations. This shows that it is possible to create a low resource cluster monitoring tool, that collects large amounts of system data, with high frequency, from each of the nodes, and analyze the data

    Similar works