8 research outputs found

    Online Anomaly Detection in HPC Systems

    Get PDF
    open4siReliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoencoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.embargoed_20200225Borghesi A.; Libri A.; Benini L.; Bartolini A.Borghesi A.; Libri A.; Benini L.; Bartolini A

    An Explainable Model for Fault Detection in HPC Systems

    Get PDF
    Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production

    pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

    Full text link
    The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code

    Design and implementation of a telemetry platform for high-performance computing environments

    Get PDF
    A new generation of high-performance and distributed computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale in today’s HPC environments. These architectures require insights into their execution behaviour and the state of their execution environment at various levels of detail, in order to make context-aware decisions. HPC telemetry provides this information. It describes the continuous stream of time series and event data that is generated on HPC systems by the hardware, operating systems, services, runtime systems, and applications. Current HPC ecosystems do not provide the conceptual models, infrastructure, and interfaces to collect, store, analyse, and integrate telemetry in a structured and efficient way. Consequently, applications and services largely depend on one-off solutions and custom-built technologies to achieve these goals; introducing significant development overheads that inhibit portability and mobility. To facilitate a broader mix of applications, more efficient application development, and swift adoption of adaptive architectures in production, a comprehensive framework for telemetry management and analysis must be provided as part of future HPC ecosystem designs. This thesis provides the blueprint for such a framework: it proposes a new approach to telemetry management in HPC: the Telemetry Platform concept. Departing from the observation that telemetry data and the corresponding analysis, and integration pat- terns on modern multi-tenant HPC systems have a lot of similarities to the patterns observed in large-scale data analytics or “Big Data” platforms, the telemetry platform concept takes the data platform paradigm and architectural approach and applies them to HPC telemetry. The result is the blueprint for a system that provides services for storing, searching, analysing, and integrating telemetry data in HPC applications and other HPC system services. It allows users to create and share telemetry data-driven insights using everything from simple time-series analysis to complex statistical and machine learning models while at the same time hiding many of the inherent complexities of data management such as data transport, clean-up, storage, cataloguing, access management, and providing appropriate and scalable analytics and integration capabilities. The main contributions of this research are (1) the application of the data platform concept to HPC telemetry data management and usage; (2) a graph-based, time-variant telemetry data model that captures structures and properties of platform and applications and in which telemetry data can be organized; (3) an architecture blueprint and prototype of a concrete implementation and integration architecture of the telemetry platform; and (4) a proposal for decoupled HPC application architectures, separating telemetry data management, and feedback-control-loop logic from the core application code. First experimental results with the prototype implementation suggest that the telemetry platform paradigm can reduce overhead and redundancy in the development of telemetry-based application architectures, and lower the barrier for HPC systems research and the provisioning of new, innovative HPC system services

    Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

    Get PDF
    Les anomalies dans les logs des systĂšmes d’information sont souvent le signe de failles ou de vulnĂ©rabilitĂ©s. Leur dĂ©tection automatique est difficile Ă  cause du manque de structure dans les logs, et de la complexitĂ© des anomalies. Les mĂ©thodes d’infĂ©rence de structure existantes sont peu flexibles : elles ne sont pas paramĂ©triques, ou reposent sur des hypothĂšses syntaxiques fortes, qui s’avĂšrent parfois inadĂ©quates. Les mĂ©thodes de dĂ©tection d’anomalies adoptent quant Ă  elles une reprĂ©sentation des donnĂ©es qui nĂ©glige le temps Ă©coulĂ© entre les logs, et sont donc inadaptĂ©es Ă  la dĂ©tection d’anomalies temporelles. La contribution de cette thĂšse est double. Nous proposons d’abord METING, une mĂ©thode d’infĂ©rence de structure paramĂ©trique et modulable. METING ne repose sur aucune hypothĂšse syntaxique forte, mais se base sur l’exploration de motifs frĂ©quents, en Ă©tudiant les n-grammes des logs. Nous montrons expĂ©rimentalement que METING surpasse les mĂ©thodes existantes, avec d’importantes amĂ©liorations sur certains jeux de donnĂ©es. Nous montrons Ă©galement que la sensibilitĂ© de notre mĂ©thode Ă  ses hyper-paramĂštres lui permet de s’adapter Ă  l’hĂ©tĂ©rogĂ©nĂ©itĂ© des jeux de donnĂ©es. Enfin, nous proposons une extension de METING au contexte de la racinisation en traitement automatique du texte, et montrons que notre approche fournit une racinisation multilingue, sans rĂšgle, et plus efficace que la mĂ©thode de Porter, rĂ©fĂ©rence de l’état de l’art. Nous prĂ©sentons Ă©galement NoTIL, une mĂ©thode de dĂ©tection de nouveautĂ©s par apprentissage profond. NoTIL utilise une reprĂ©sentation des donnĂ©es capable de dĂ©tecter les irrĂ©gularitĂ©s temporelles dans les logs. Notre mĂ©thode repose sur l’apprentissage d’une tĂąche de prĂ©diction intermĂ©diaire pour modĂ©liser le comportement nominal des logs. Nous comparons notre mĂ©thode Ă  celles de l’état de l’art et concluons que NoTIL est la mĂ©thode capable de traiter la plus grande variĂ©tĂ© d’anomalies, grĂące aux choix de sa reprĂ©sentation des donnĂ©es.The anomalies in the logs of information system are often the sign of faults and vulnerabilities. Their detection is challenging due to the lack of structure in logs and the complexity of the anomalies. Existing methods to infer the structure are poorly flexible: they are not parametric, or rely on strong syntactic assumptions, which sometimes prove to be inadequate. Anomaly detection methods adopt a data representation that neglects the time elapsed between the logs, and are therefore unsuitable for the detection of temporal anomalies. The contribution of this thesis is twofold. We first propose METING, a parametric and modular structure inference method. METING does not rely on any strong syntactic assumption, but is based on the mining of frequent patterns, through the study of n-grams. We experimentally show that METING surpasses the existing methods, with important improvements on some datasets. We also show the important sensitivity of our method to its hyper-parameters, which allows the exploration of many configurations, and the adaptation to the heterogeneity of datasets. Finally, we propose an extension of METING to the context of stemming in text mining, and show that our approach provides a stemming solution that is multilingual, rule-free, and more efficient than that of Porter, the state-of-the-art reference. We also present NoTIL, a novelty detection method based on deep learning. NoTIL uses a data representation capable of detecting temporal irregularities in the logs. Our method is based on the learning of an intermediate prediction task to model the nominal behavior of logs. We compare our method to the most up-to-date references and conclude that NoTIL is the method capable of dealing with the greatest variety of anomalies, thanks to its data representation

    Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

    Get PDF
    Les anomalies dans les logs des systĂšmes d’information sont souvent le signe de failles ou de vulnĂ©rabilitĂ©s. Leur dĂ©tection automatique est difficile Ă  cause du manque de structure dans les logs, et de la complexitĂ© des anomalies. Les mĂ©thodes d’infĂ©rence de structure existantes sont peu flexibles : elles ne sont pas paramĂ©triques, ou reposent sur des hypothĂšses syntaxiques fortes, qui s’avĂšrent parfois inadĂ©quates. Les mĂ©thodes de dĂ©tection d’anomalies adoptent quant Ă  elles une reprĂ©sentation des donnĂ©es qui nĂ©glige le temps Ă©coulĂ© entre les logs, et sont donc inadaptĂ©es Ă  la dĂ©tection d’anomalies temporelles. La contribution de cette thĂšse est double. Nous proposons d’abord METING, une mĂ©thode d’infĂ©rence de structure paramĂ©trique et modulable. METING ne repose sur aucune hypothĂšse syntaxique forte, mais se base sur l’exploration de motifs frĂ©quents, en Ă©tudiant les n-grammes des logs. Nous montrons expĂ©rimentalement que METING surpasse les mĂ©thodes existantes, avec d’importantes amĂ©liorations sur certains jeux de donnĂ©es. Nous montrons Ă©galement que la sensibilitĂ© de notre mĂ©thode Ă  ses hyper-paramĂštres lui permet de s’adapter Ă  l’hĂ©tĂ©rogĂ©nĂ©itĂ© des jeux de donnĂ©es. Enfin, nous proposons une extension de METING au contexte de la racinisation en traitement automatique du texte, et montrons que notre approche fournit une racinisation multilingue, sans rĂšgle, et plus efficace que la mĂ©thode de Porter, rĂ©fĂ©rence de l’état de l’art. Nous prĂ©sentons Ă©galement NoTIL, une mĂ©thode de dĂ©tection de nouveautĂ©s par apprentissage profond. NoTIL utilise une reprĂ©sentation des donnĂ©es capable de dĂ©tecter les irrĂ©gularitĂ©s temporelles dans les logs. Notre mĂ©thode repose sur l’apprentissage d’une tĂąche de prĂ©diction intermĂ©diaire pour modĂ©liser le comportement nominal des logs. Nous comparons notre mĂ©thode Ă  celles de l’état de l’art et concluons que NoTIL est la mĂ©thode capable de traiter la plus grande variĂ©tĂ© d’anomalies, grĂące aux choix de sa reprĂ©sentation des donnĂ©es.The anomalies in the logs of information system are often the sign of faults and vulnerabilities. Their detection is challenging due to the lack of structure in logs and the complexity of the anomalies. Existing methods to infer the structure are poorly flexible: they are not parametric, or rely on strong syntactic assumptions, which sometimes prove to be inadequate. Anomaly detection methods adopt a data representation that neglects the time elapsed between the logs, and are therefore unsuitable for the detection of temporal anomalies. The contribution of this thesis is twofold. We first propose METING, a parametric and modular structure inference method. METING does not rely on any strong syntactic assumption, but is based on the mining of frequent patterns, through the study of n-grams. We experimentally show that METING surpasses the existing methods, with important improvements on some datasets. We also show the important sensitivity of our method to its hyper-parameters, which allows the exploration of many configurations, and the adaptation to the heterogeneity of datasets. Finally, we propose an extension of METING to the context of stemming in text mining, and show that our approach provides a stemming solution that is multilingual, rule-free, and more efficient than that of Porter, the state-of-the-art reference. We also present NoTIL, a novelty detection method based on deep learning. NoTIL uses a data representation capable of detecting temporal irregularities in the logs. Our method is based on the learning of an intermediate prediction task to model the nominal behavior of logs. We compare our method to the most up-to-date references and conclude that NoTIL is the method capable of dealing with the greatest variety of anomalies, thanks to its data representation
    corecore