AutoMon. Automatic monitoring and problem detection for distributed systems

Wikstad, Magnus

AutoMon. Automatic monitoring and problem detection for distributed systems

Authors: Magnus Wikstad
Publication date: 1 January 2016
Publisher: UiT Norges arktiske universitet

Abstract

When working with distributed systems, detecting faults can be a difficult task, as abnormalities isn't necessarily immediately evident by warnings or system crashes. This is especially true with subtle faults, such as variations in performance of a running program, it is not necessarily its own fault, but could rather be from a different source, somewhere in the cluster, using a lot of resources (CPU, IO, etc.), thereby causing other programs to perform sub-par compared to earlier executions. These types of problems won't necessarily be detected by regular cluster monitoring tools, as these only look at cluster metrics, or by distributed debuggers, as these only monitor specific programs, and thus won't find the cause for the degraded performance if it comes from a different source. As the usage of distributed systems is becoming more common amongst those without an intimate knowledge about these systems, being able to quickly inform the user about any faults or abnormalities, would be a great improvement on their efficient use of the system. It would additionally be a great help to developers, as they could easily get their programs performance data without implementing specific procedures for the task, thus simplifying the development of new distributed software. This thesis is looking to discover if the system, and process, information attainable from each nodes operating system, is enough to detect abnormal operation. This is approached by creating a prototype system that collects this information from the cluster, and doing analysis on the data during runtime to check for faults. The achieved system is capable of collecting large amounts of data from the cluster, storing it, and doing some rudimentary analysis on the data. While leaving most of the clusters resources free for its computations. This shows that it is possible to create a low resource cluster monitoring tool, that collects large amounts of system data, with high frequency, from each of the nodes, and analyze the data

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

NORA - Norwegian Open Research Archives

oai:munin.uit.no:10037/9359

Last time updated on 16/12/2017

Munin - Open Research Archive

oai:munin.uit.no:10037/9359

Last time updated on 26/03/2021