67 research outputs found

    Data Driven Device Failure Prediction

    Get PDF
    As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensuring those systems do not fail also becomes more important. Many organizations depend heavily on desktop computers for day to day operations. Unfortunately, the software that runs on these computers is still written by humans and as such, is still subject to human error and consequent failure. A natural solution is to use statistical machine learning to predict failure. However, since failure is still a relatively rare event, obtaining labeled training data to train these models is not trivial. This work presents new simulated fault loads with an automated framework to predict failure in the Microsoft enterprise authentication service and Apache web server in an effort to increase up-time and improve mission effectiveness. These new fault loads were successful in creating realistic failure conditions that are accurately identified by statistical learning models

    Understanding error log event sequence for failure analysis

    Get PDF
    Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarit

    Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems

    Get PDF
    System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25× and 15× faster than Desh respectively

    Towards efficient error detection in large-scale HPC systems

    Get PDF
    The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied. Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%

    The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

    Get PDF
    With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics

    Analysis of Gemini interconnect recovery mechanisms: methods and observations

    Get PDF
    This thesis focuses on the resilience of network components, and recovery capabilities of extreme-scale high-performance computing (HPC) systems, specifically petaflop-level supercomputers, aimed at solving complex science, engineering, and business problems that require high bandwidth, enhanced networking, and high compute capabilities. The resilience of the network is critical for ensuring successful execution of the applications and overall system availability. Failure of interconnect components such as links, routers, power supply, etc. pose a threat to the resilience of the interconnect network, causing application failures and, in the worst case, system-wide failure. An extreme-scale system is designed to manage these failures and automatically recover from such failures to ensure successful application execution and avoid system-wide failure. Thus, in this thesis, we characterize the success probability of the recovery procedures as well as the impact of the recovery procedures on the applications. We developed an interconnect recovery mechanisms analysis tool (I-RAT), a plugin built on top of LogDiver to characterize and assess the impact of recovery mechanisms. The tool was used to analyze more than two years of network/system logs from Blue Waters, a supercomputer operated by the NCSA at the University of Illinois. Our analyses show that recovery mechanisms are frequently triggered (in as little as 36 hours for link failovers) that can fail with relatively high probability (as much as 0.25 for link failover). Furthermore, the analyses show that system resilience does not equate to application resilience since executing applications can fail with non-negligible probability during (or just after) a successful recovery. Our analyses show that interconnect recovery mechanisms are frequently triggered (the mean time between triggers is as short as 36 hours for link failovers), and the initiated recovery fails with relatively high probability (as much as 0.25 for link failover). We also show that as many as 20\% of the executing applications fail during the recovery phase

    Features correlation-based workflows for high-performance computing systems diagnosis

    Get PDF
    Analysing failures to improve the reliability of high performance computing systems and data centres is important. The primary source of information for diagnosing system failures is the system logs and it is widely known that finding the cause of a system failure using only system logs is incomplete. Resource utilisation data – recently made available – is another potential useful source of information for failure analysis. However, large High-Performance Computing (HPC) systems generate a lot of data. Processing the huge amount of data presents a significant challenge for online failure diagnosis. Most of the work on failure diagnosis have studied errors that lead to system failures only, but there is little work that study errors which lead to a system failure or recovery on real data. In this thesis, we design, implement and evaluate two failure diagnostics frameworks. We name the frameworks CORRMEXT and EXERMEST. We implement the Data Type Extraction, Feature Extraction, Correlation and Time-bin Extraction modules. CORRMEXT integrates the Data Type Extraction, Correlation and Time-bin Extraction modules. It identifies error cases that occur frequently and reports the success and failure of error recovery protocols. EXERMEST integrates the Feature Extraction and Correlation modules. It extracts significant errors and resource use counters and identifies error cases that are rare. We apply the diagnostics frameworks on the resource use data and system logs on three HPC systems operated by the Texas Advanced Computing Center (TACC). Our results show that: (i) multiple correlation methods are required for identifying more dates of groups of correlated resource use counters and groups of correlated errors, (ii) the earliest hour of change in system behaviour can only be identified by using the correlated resource use counters and correlated errors, (iii) multiple feature extraction methods are required for identifying the rare error cases, and (iv) time-bins of multiple granularities are necessary for identifying the rare error cases. CORRMEXT and EXERMEST are available on the public domain for supporting system administrators in failure diagnosis

    Big data analytics towards predictive maintenance at the INFN-CNAF computing centre

    Get PDF
    La Fisica delle Alte Energie (HEP) è da lungo tra i precursori nel gestire e processare enormi dataset scientifici e nell'operare alcuni tra i più grandi data centre per applicazioni scientifiche. HEP ha sviluppato una griglia computazionale (Grid) per il calcolo al Large Hadron Collider (LHC) del CERN di Ginevra, che attualmente coordina giornalmente le operazioni di calcolo su oltre 800k processori in 170 centri di calcolo e gestendo mezzo Exabyte di dati su disco distribuito in 5 continenti. Nelle prossime fasi di LHC, soprattutto in vista di Run-4, il quantitativo di dati gestiti dai centri di calcolo aumenterà notevolmente. In questo contesto, la HEP Software Foundation ha redatto un Community White Paper (CWP) che indica il percorso da seguire nell'evoluzione del software moderno e dei modelli di calcolo in preparazione alla fase cosiddetta di High Luminosity di LHC. Questo lavoro ha individuato in tecniche di Big Data Analytics un enorme potenziale per affrontare le sfide future di HEP. Uno degli sviluppi riguarda la cosiddetta Operation Intelligence, ovvero la ricerca di un aumento nel livello di automazione all'interno dei workflow. Questo genere di approcci potrebbe portare al passaggio da un sistema di manutenzione reattiva ad uno, più evoluto, di manutenzione predittiva o addirittura prescrittiva. La tesi presenta il lavoro fatto in collaborazione con il centro di calcolo dell'INFN-CNAF per introdurre un sistema di ingestione, organizzazione e processing dei log del centro su una piattaforma di Big Data Analytics unificata, al fine di prototipizzare un modello di manutenzione predittiva per il centro. Questa tesi contribuisce a tale progetto con lo sviluppo di un algoritmo di clustering dei messaggi di log basato su misure di similarità tra campi testuali, per superare il limite connesso alla verbosità ed eterogeneità dei log raccolti dai vari servizi operativi 24/7 al centro
    • …
    corecore