2,385 research outputs found
Online Fault Classification in HPC Systems through Machine Learning
As High-Performance Computing (HPC) systems strive towards the exascale goal,
studies suggest that they will experience excessive failure rates. For this
reason, detecting and classifying faults in HPC systems as they occur and
initiating corrective actions before they can transform into failures will be
essential for continued operation. In this paper, we propose a fault
classification method for HPC systems based on machine learning that has been
designed specifically to operate with live streamed data. We cast the problem
and its solution within realistic operating constraints of online use. Our
results show that almost perfect classification accuracy can be reached for
different fault types with low computational overhead and minimal delay. We
have based our study on a local dataset, which we make publicly available, that
was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
Anomaly Detection using Autoencoders in High Performance Computing Systems
Anomaly detection in supercomputers is a very difficult problem due to the
big scale of the systems and the high number of components. The current state
of the art for automated anomaly detection employs Machine Learning methods or
statistical regression models in a supervised fashion, meaning that the
detection tool is trained to distinguish among a fixed set of behaviour classes
(healthy and unhealthy states).
We propose a novel approach for anomaly detection in High Performance
Computing systems based on a Machine (Deep) Learning technique, namely a type
of neural network called autoencoder. The key idea is to train a set of
autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes
and, after training, use them to identify abnormal conditions. This is
different from previous approaches which where based on learning the abnormal
condition, for which there are much smaller datasets (since it is very hard to
identify them to begin with).
We test our approach on a real supercomputer equipped with a fine-grained,
scalable monitoring infrastructure that can provide large amount of data to
characterize the system behaviour. The results are extremely promising: after
the training phase to learn the normal system behaviour, our method is capable
of detecting anomalies that have never been seen before with a very good
accuracy (values ranging between 88% and 96%).Comment: 9 pages, 3 figure
An Explainable Model for Fault Detection in HPC Systems
Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production
- …