Anomaly detection in supercomputers is a very difficult problem due to the
big scale of the systems and the high number of components. The current state
of the art for automated anomaly detection employs Machine Learning methods or
statistical regression models in a supervised fashion, meaning that the
detection tool is trained to distinguish among a fixed set of behaviour classes
(healthy and unhealthy states).
We propose a novel approach for anomaly detection in High Performance
Computing systems based on a Machine (Deep) Learning technique, namely a type
of neural network called autoencoder. The key idea is to train a set of
autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes
and, after training, use them to identify abnormal conditions. This is
different from previous approaches which where based on learning the abnormal
condition, for which there are much smaller datasets (since it is very hard to
identify them to begin with).
We test our approach on a real supercomputer equipped with a fine-grained,
scalable monitoring infrastructure that can provide large amount of data to
characterize the system behaviour. The results are extremely promising: after
the training phase to learn the normal system behaviour, our method is capable
of detecting anomalies that have never been seen before with a very good
accuracy (values ranging between 88% and 96%).Comment: 9 pages, 3 figure