5 research outputs found
Lessons learned from spatial and temporal correlation of node failures in high performance computers
In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures
Assessing Data Usefulness for Failure Analysis in Anonymized System Logs
System logs are a valuable source of information for the analysis and
understanding of systems behavior for the purpose of improving their
performance. Such logs contain various types of information, including
sensitive information. Information deemed sensitive can either directly be
extracted from system log entries by correlation of several log entries, or can
be inferred from the combination of the (non-sensitive) information contained
within system logs with other logs and/or additional datasets. The analysis of
system logs containing sensitive information compromises data privacy.
Therefore, various anonymization techniques, such as generalization and
suppression have been employed, over the years, by data and computing centers
to protect the privacy of their users, their data, and the system as a whole.
Privacy-preserving data resulting from anonymization via generalization and
suppression may lead to significantly decreased data usefulness, thus,
hindering the intended analysis for understanding the system behavior.
Maintaining a balance between data usefulness and privacy preservation,
therefore, remains an open and important challenge. Irreversible encoding of
system logs using collision-resistant hashing algorithms, such as SHAKE-128, is
a novel approach previously introduced by the authors to mitigate data privacy
concerns. The present work describes a study of the applicability of the
encoding approach from earlier work on the system logs of a production high
performance computing system. Moreover, a metric is introduced to assess the
data usefulness of the anonymized system logs to detect and identify the
failures encountered in the system.Comment: 11 pages, 3 figures, submitted to 17th IEEE International Symposium
on Parallel and Distributed Computin
Anomaly Detection in High Performance Computers: A Vicinity Perspective
In response to the demand for higher computational power, the number of
computing nodes in high performance computers (HPC) increases rapidly. Exascale
HPC systems are expected to arrive by 2020. With drastic increase in the number
of HPC system components, it is expected to observe a sudden increase in the
number of failures which, consequently, poses a threat to the continuous
operation of the HPC systems. Detecting failures as early as possible and,
ideally, predicting them, is a necessary step to avoid interruptions in HPC
systems operation. Anomaly detection is a well-known general purpose approach
for failure detection, in computing systems. The majority of existing methods
are designed for specific architectures, require adjustments on the computing
systems hardware and software, need excessive information, or pose a threat to
users' and systems' privacy. This work proposes a node failure detection
mechanism based on a vicinity-based statistical anomaly detection approach
using passively collected and anonymized system log entries. Application of the
proposed approach on system logs collected over 8 months indicates an anomaly
detection precision between 62% to 81%.Comment: 9 pages, Submitted to the 18th IEEE International Symposium on
Parallel and Distributed Computin