Search CORE

859 research outputs found

ALBADross: active learning based anomaly diagnosis for production HPC systems

Author: Aaziz Omar
Aksar Burak
Brandt Jim
Coskun Ayse K.
Kulis Brian
Leung Vitus J.
Schwaller Benjamin
Sencan Efe
Publication venue: IEEE
Publication date: 15/02/2023
Field of study

000000000000000000000000000000000000000000000000000002263712 - Sandia National Laboratories; Sandia National LaboratoriesAccepted manuscrip

Boston University Institutional Repository (OpenBU)

Anomaly Detection using Autoencoders in High Performance Computing Systems

Author: Bartolini Andrea
Benini Luca
Borghesi Andrea
Lombardi Michele
Milano Michela
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 13/11/2018
Field of study

Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).Comment: 9 pages, 3 figure

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Association for the Advancement of Artificial Intelligence: AAAI Publications

Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

Author: Tuncer Ozan
Publication venue
Publication date: 03/07/2018
Field of study

Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

Boston University Institutional Repository (OpenBU)

An Explainable Model for Fault Detection in HPC Systems

Author: Bartolini A.
Beneventi F.
Borghesi A.
Guarrasi M.
Molan M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Footprinting parallel I/O – machine learning to classify application’s I/O behavior

Author: A Busch
B Seo
J Schmid
JT Palmer
K Julian
O Tuncer
S Madireddy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Central Archive at the University of Reading

Crossref

Anomaly Detection and Anticipation in High Performance Computing Systems

Author: Bartolini A.
Borghesi A.
Milano M.
Molan M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger and more complex, together with the issues concerning their maintenance. Luckily, many current HPC systems are endowed with data monitoring infrastructures that characterize the system state, and whose data can be used to train Deep Learning (DL) anomaly detection models, a very popular research area. However, the lack of labels describing the state of the system is a wide-spread issue, as annotating data is a costly task, generally falling on human system administrators and thus does not scale toward exascale. In this article we investigate the possibility to extract labels from a service monitoring tool (Nagios) currently used by HPC system administrators to flag the nodes which undergo maintenance operations. This allows to automatically annotate data collected by a fine-grained monitoring infrastructure; this labelled data is then used to train and validate a DL model for anomaly detection. We conduct the experimental evaluation on a tier-0 production supercomputer hosted at CINECA, Bologna, Italy. The results reveal that the DL model can accurately detect the real failures, and, moreover, it can predict the insurgency of anomalies, by systematically anticipating the actual labels (i.e., the moment when system administrators realize when an anomalous event happened); the average advance time computed on historical traces is around 45 minutes. The proposed technology can be easily scaled toward exascale systems to easy their maintenance

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Automating telemetry- and trace-based analytics on large-scale distributed systems

Author: Ateş Emre
Publication venue
Publication date: 28/09/2020
Field of study

Large-scale distributed systems---such as supercomputers, cloud computing platforms, and distributed applications---routinely suffer from slowdowns and crashes due to software and hardware problems, resulting in reduced efficiency and wasted resources. These large-scale systems typically deploy monitoring or tracing systems that gather a variety of statistics about the state of the hardware and the software. State-of-the-art methods either analyze this data manually, or design unique automated methods for each specific problem. This thesis builds on the vision that generalized automated analytics methods on the data sets collected from these complex computing systems provide critical information about the causes of the problems, and this analysis can then enable proactive management to improve performance, resilience, efficiency, or security significantly beyond current limits. This thesis seeks to design scalable, automated analytics methods and frameworks for large-scale distributed systems that minimize dependency on expert knowledge, automate parts of the solution process, and help make systems more resilient. In addition to analyzing data that is already collected from systems, our frameworks also identify what to collect from where in the system, such that the collected data would be concise and useful for manual analytics. We focus on two data sources for conducting analytics: numeric telemetry data, which is typically collected from operating system or hardware counters, and end-to-end traces collected from distributed applications. This thesis makes the following contributions in large-scale distributed systems: (1) Designing a framework for accurately diagnosing previously encountered performance variations, (2) designing a technique for detecting (unwanted) applications running on the systems, (3) developing a suite for reproducing performance variations that can be used to systematically develop analytics methods, (4) designing a method to explain predictions of black-box machine learning frameworks, and (5) constructing an end-to-end tracing framework that can dynamically adjust instrumentation for effective diagnosis of performance problems.2021-09-28T00:00:00

Boston University Institutional Repository (OpenBU)