7,585 research outputs found

    Predicting Hard Disk Failures in Data Centers Using Temporal Convolutional Neural Networks

    Get PDF
    In modern data centers, storage system failures are major contributors to downtimes and maintenance costs. Predicting these failures by collecting measurements from disks and analyzing them with machine learning techniques can effectively reduce their impact, enabling timely maintenance. While there is a vast literature on this subject, most approaches attempt to predict hard disk failures using either classic machine learning solutions, such as Random Forests (RFs) or deep Recurrent Neural Networks (RNNs). In this work, we address hard disk failure prediction using Temporal Convolutional Networks (TCNs), a novel type of deep neural network for time series analysis. Using a real-world dataset, we show that TCNs outperform both RFs and RNNs. Specifically, we can improve the Fault Detection Rate (FDR) of ≈ 7.5% (FDR = 89.1%) compared to the state-of-the-art, while simultaneously reducing the False Alarm Rate (FAR = 0.052%). Moreover, we explore the network architecture design space showing that TCNs are consistently superior to RNNs for a given model size and complexity and that even relatively small TCNs can reach satisfactory performance. All the codes to reproduce the results presented in this paper are available at https://github.com/ABurrello/tcn-hard-disk-failure-prediction

    Diffusion-based Time Series Data Imputation for Microsoft 365

    Full text link
    Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing works focus on predicting cloud failures and proactively taking action before failures happen. However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently based on the observed data. Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task

    IU PTI/UITS Research Technologies Annual Report: FY 2014

    Get PDF
    This Fiscal Year 2014 (FY2014) report outlines IU community accomplishments using IU's cyberinfrastructure, as they relate to several IU Bicentennial Strategic Plan goals and ongoing principles of excellence. The report includes research and discovery highlights

    Parameters selection for information storage reliability assessment and prediction by absolute values

    Get PDF
    © 2018, Institute of Advanced Scientific Research, Inc.. All rights reserved. The problem of choosing parameters for estimating and predicting the reliability of an information storage device is considered. It is that manufacturers of hard disk drives do not always unambiguously fill SMART parameters with corresponding values for different models. In addition, some of the parameters are sometimes empty, while the other parameters have only zero values.The scientific task of the research consists in the need to define such a set of parameters that will allow estimating and predicting the reliability of each individual storage device of any model of any manufacturer for its timely replacement. For this purpose, a separate grouping of normally operating, early-decommissioned and failed drives was performed.The scale of the values for each parameter was divided into ranges. A number of storage devices that fall within a certain range of values, was counted. The distribution of storage devices was studied in absolute values for each parameter under consideration. The following conditions were used to select suitable parameters for estimating and predicting the reliability of the parameters based on their values: 1) The number of normally operating drives that have a reliability parameter value within the range of large values should always be less than those that failed; 2) The monotonicity of the increase in the number of drives in the series should be observed for large values of reliability parameters: normally operating, early removed, and failed; 3) The first two conditions must be fulfilled both in general and in particular, for example, for the drives of each manufacturer separately. Nine parameters were selected as a result of studying absolute values for the suitability to use in evaluating and predicting the reliability of data storage devices: 1 Raw read error rate, 5 Reallocated sectors count, 7 Seek error rate, 10 Spin-up retry count, 184 End-to-end error, 187 Reported uncorrectable errors, 196 Reallocation event count, 197 Current pending sector count, 198 Uncorrectable sector count
    • …
    corecore