3,431 research outputs found

    Data-driven Models for Remaining Useful Life Estimation of Aircraft Engines and Hard Disk Drives

    Get PDF
    Failure of physical devices can cause inconvenience, loss of money, and sometimes even deaths. To improve the reliability of these devices, we need to know the remaining useful life (RUL) of a device at a given point in time. Data-driven approaches use data from a physical device to build a model that can estimate the RUL. They have shown great performance and are often simpler than traditional model-based approaches. Typical statistical and machine learning approaches are often not suited for sequential data prediction. Recurrent Neural Networks are designed to work with sequential data but suffer from the vanishing gradient problem over time. Therefore, I explore the use of Long Short-Term Memory (LSTM) networks for RUL prediction. I perform two experiments. First, I train bidirectional LSTM networks on the Backblaze hard-disk drive dataset. I achieve an accuracy of 96.4\% on a 60 day time window, state-of-the-art performance. Additionally, I use a unique standardization method that standardizes each hard drive instance independently and explore the benefits and downsides of this approach. Finally, I train LSTM models on the NASA N-CMAPSS dataset to predict aircraft engine remaining useful life. I train models on each of the eight sub-datasets, achieving a RMSE of 6.304 on one of the sub-datasets, the second-best in the current literature. I also compare an LSTM network\u27s performance to the performance of a Random Forest and Temporal Convolutional Neural Network model, demonstrating the LSTM network\u27s superior performance. I find that LSTM networks are capable predictors for device remaining useful life and show a thorough model development process that can be reproduced to develop LSTM models for various RUL prediction tasks. These models will be able to improve the reliability of devices such as aircraft engines and hard-disk drives

    Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters

    Full text link
    On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10 year data while being able to generalize competitively over other drives from the Seagate family.Comment: 8 pages, 9 figures and 6 table

    Examining the impact of critical attributes on hard drive failure times: Multi-state models for left-truncated and right-censored semi-competing risks data

    Get PDF
    \ua9 2023 The Authors. Applied Stochastic Models in Business and Industry published by John Wiley & Sons Ltd. The ability to predict failures in hard disk drives (HDDs) is a major objective of HDD manufacturers since avoiding unexpected failures may prevent data loss, improve service reliability, and reduce data center downtime. Most HDDs are equipped with a threshold-based monitoring system named self-monitoring, analysis and reporting technology (SMART). The system collects several performance metrics, called SMART attributes, and detects anomalies that may indicate incipient failures. SMART works as a nascent failure detection method and does not estimate the HDDs\u27 remaining useful life. We define critical attributes and critical states for hard drives using SMART attributes and fit multi-state models to the resulting semi-competing risks data. The multi-state models provide a coherent and novel way to model the failure time of a hard drive and allow us to examine the impact of critical attributes on the failure time of a hard drive. We derive dynamic predictions of conditional survival probabilities, which are adaptive to the state of the drive. Using a dataset of HDDs equipped with SMART, we find that drives are more likely to fail after entering critical states. We evaluate the predictive accuracy of the proposed models with a case study of HDDs equipped with SMART, using the time-dependent area under the receiver operating characteristic curve (AUC) and the expected prediction error (PE). The results suggest that accounting for changes in the critical attributes improves the accuracy of dynamic predictions

    Towards Data-Driven Autonomics in Data Centers

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

    Model-Augmented Estimation of Conditional Mutual Information for Feature Selection

    Full text link
    Markov blanket feature selection, while theoretically optimal, is generally challenging to implement. This is due to the shortcomings of existing approaches to conditional independence (CI) testing, which tend to struggle either with the curse of dimensionality or computational complexity. We propose a novel two-step approach which facilitates Markov blanket feature selection in high dimensions. First, neural networks are used to map features to low-dimensional representations. In the second step, CI testing is performed by applying the kk-NN conditional mutual information estimator to the learned feature maps. The mappings are designed to ensure that mapped samples both preserve information and share similar information about the target variable if and only if they are close in Euclidean distance. We show that these properties boost the performance of the kk-NN estimator in the second step. The performance of the proposed method is evaluated on both synthetic and real data.Comment: Accepted to UAI 202

    Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...

    Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

    Get PDF
    Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem
    • …
    corecore