Search CORE

8 research outputs found

Online Anomaly Detection in HPC Systems

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/08/2019
Field of study

open4siReliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoencoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.embargoed_20200225Borghesi A.; Libri A.; Benini L.; Bartolini A.Borghesi A.; Libri A.; Benini L.; Bartolini A

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An Explainable Model for Fault Detection in HPC Systems

Author: Bartolini A.
Beneventi F.
Borghesi A.
Guarrasi M.
Molan M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Author: Bartolini Andrea
Benini Luca
Libri Antonio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Design and implementation of a telemetry platform for high-performance computing environments

Author: Weidner Ole
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

A new generation of high-performance and distributed computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale in today’s HPC environments. These architectures require insights into their execution behaviour and the state of their execution environment at various levels of detail, in order to make context-aware decisions. HPC telemetry provides this information. It describes the continuous stream of time series and event data that is generated on HPC systems by the hardware, operating systems, services, runtime systems, and applications. Current HPC ecosystems do not provide the conceptual models, infrastructure, and interfaces to collect, store, analyse, and integrate telemetry in a structured and efficient way. Consequently, applications and services largely depend on one-off solutions and custom-built technologies to achieve these goals; introducing significant development overheads that inhibit portability and mobility. To facilitate a broader mix of applications, more efficient application development, and swift adoption of adaptive architectures in production, a comprehensive framework for telemetry management and analysis must be provided as part of future HPC ecosystem designs. This thesis provides the blueprint for such a framework: it proposes a new approach to telemetry management in HPC: the Telemetry Platform concept. Departing from the observation that telemetry data and the corresponding analysis, and integration pat- terns on modern multi-tenant HPC systems have a lot of similarities to the patterns observed in large-scale data analytics or “Big Data” platforms, the telemetry platform concept takes the data platform paradigm and architectural approach and applies them to HPC telemetry. The result is the blueprint for a system that provides services for storing, searching, analysing, and integrating telemetry data in HPC applications and other HPC system services. It allows users to create and share telemetry data-driven insights using everything from simple time-series analysis to complex statistical and machine learning models while at the same time hiding many of the inherent complexities of data management such as data transport, clean-up, storage, cataloguing, access management, and providing appropriate and scalable analytics and integration capabilities. The main contributions of this research are (1) the application of the data platform concept to HPC telemetry data management and usage; (2) a graph-based, time-variant telemetry data model that captures structures and properties of platform and applications and in which telemetry data can be organized; (3) an architecture blueprint and prototype of a concrete implementation and integration architecture of the telemetry platform; and (4) a proposal for decoupled HPC application architectures, separating telemetry data management, and feedback-control-loop logic from the core application code. First experimental results with the prototype implementation suggest that the telemetry platform paradigm can reduce overhead and redundancy in the development of telemetry-based application architectures, and lower the barrier for HPC systems research and the provisioning of new, innovative HPC system services

Edinburgh Research Archive

Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

Author: Coustié Oihana
Publication venue
Publication date: 11/05/2021
Field of study

Les anomalies dans les logs des systèmes d’information sont souvent le signe de failles ou de vulnérabilités. Leur détection automatique est difficile à cause du manque de structure dans les logs, et de la complexité des anomalies. Les méthodes d’inférence de structure existantes sont peu flexibles : elles ne sont pas paramétriques, ou reposent sur des hypothèses syntaxiques fortes, qui s’avèrent parfois inadéquates. Les méthodes de détection d’anomalies adoptent quant à elles une représentation des données qui néglige le temps écoulé entre les logs, et sont donc inadaptées à la détection d’anomalies temporelles. La contribution de cette thèse est double. Nous proposons d’abord METING, une méthode d’inférence de structure paramétrique et modulable. METING ne repose sur aucune hypothèse syntaxique forte, mais se base sur l’exploration de motifs fréquents, en étudiant les n-grammes des logs. Nous montrons expérimentalement que METING surpasse les méthodes existantes, avec d’importantes améliorations sur certains jeux de données. Nous montrons également que la sensibilité de notre méthode à ses hyper-paramètres lui permet de s’adapter à l’hétérogénéité des jeux de données. Enfin, nous proposons une extension de METING au contexte de la racinisation en traitement automatique du texte, et montrons que notre approche fournit une racinisation multilingue, sans règle, et plus efficace que la méthode de Porter, référence de l’état de l’art. Nous présentons également NoTIL, une méthode de détection de nouveautés par apprentissage profond. NoTIL utilise une représentation des données capable de détecter les irrégularités temporelles dans les logs. Notre méthode repose sur l’apprentissage d’une tâche de prédiction intermédiaire pour modéliser le comportement nominal des logs. Nous comparons notre méthode à celles de l’état de l’art et concluons que NoTIL est la méthode capable de traiter la plus grande variété d’anomalies, grâce aux choix de sa représentation des données.The anomalies in the logs of information system are often the sign of faults and vulnerabilities. Their detection is challenging due to the lack of structure in logs and the complexity of the anomalies. Existing methods to infer the structure are poorly flexible: they are not parametric, or rely on strong syntactic assumptions, which sometimes prove to be inadequate. Anomaly detection methods adopt a data representation that neglects the time elapsed between the logs, and are therefore unsuitable for the detection of temporal anomalies. The contribution of this thesis is twofold. We first propose METING, a parametric and modular structure inference method. METING does not rely on any strong syntactic assumption, but is based on the mining of frequent patterns, through the study of n-grams. We experimentally show that METING surpasses the existing methods, with important improvements on some datasets. We also show the important sensitivity of our method to its hyper-parameters, which allows the exploration of many configurations, and the adaptation to the heterogeneity of datasets. Finally, we propose an extension of METING to the context of stemming in text mining, and show that our approach provides a stemming solution that is multilingual, rule-free, and more efficient than that of Porter, the state-of-the-art reference. We also present NoTIL, a novelty detection method based on deep learning. NoTIL uses a data representation capable of detecting temporal irregularities in the logs. Our method is based on the learning of an intermediate prediction task to model the nominal behavior of logs. We compare our method to the most up-to-date references and conclude that NoTIL is the method capable of dealing with the greatest variety of anomalies, thanks to its data representation

Toulouse Capitole Publications

Toulouse 1 Capitole Publications

Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

Author: Coustié Oihana
Publication venue
Publication date: 11/05/2021
Field of study

Toulouse Capitole Publications