17 research outputs found

    Redundancy and Aging of Efficient Multidimensional MDS-Parity Protected Distributed Storage Systems

    Full text link
    The effect of redundancy on the aging of an efficient Maximum Distance Separable (MDS) parity--protected distributed storage system that consists of multidimensional arrays of storage units is explored. In light of the experimental evidences and survey data, this paper develops generalized expressions for the reliability of array storage systems based on more realistic time to failure distributions such as Weibull. For instance, a distributed disk array system is considered in which the array components are disseminated across the network and are subject to independent failure rates. Based on such, generalized closed form hazard rate expressions are derived. These expressions are extended to estimate the asymptotical reliability behavior of large scale storage networks equipped with MDS parity-based protection. Unlike previous studies, a generic hazard rate function is assumed, a generic MDS code for parity generation is used, and an evaluation of the implications of adjustable redundancy level for an efficient distributed storage system is presented. Results of this study are applicable to any erasure correction code as long as it is accompanied with a suitable structure and an appropriate encoding/decoding algorithm such that the MDS property is maintained.Comment: 11 pages, 6 figures, Accepted for publication in IEEE Transactions on Device and Materials Reliability (TDMR), Nov. 201

    Scalable File Systems for High Performance Computing Final Report

    Full text link

    RELIABILITY MODEL AND ASSESSMENT OF REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) INCORPORATING LATENT DEFECTS AND NON-HOMOGENEOUS POISSON PROCESS EVENTS.

    Get PDF
    Today's most reliable data storage systems are made of redundant arrays of inexpensive disks (RAID). The quantification of RAID system reliability is often based on models that omit critical hard disk drive failure modes, assume all failure and restoration rates are constant (exponential distributions), and assume the RAID group times to failure follow a homogeneous Poisson process (HPP). This paper presents a comprehensive reliability model that accounts for numerous failure causes for today's hard disk drives, allows proper representation of repair and restoration, and does not rely on the assumption of a HPP for the RAID group. The model does not assume hard disk drives have constant transition rates, but allows each hard disk drive "slot" in the RAID group to have its own set of distributions, closed form or user defined. Hard disk drive (HDD) failure distributions derived from field usage are presented, showing that failure distributions are commonly non-homogeneous, frequently having increasing hazard rates from time zero. Hard disks drive failure modes and causes are presented and used to develop a model that reflects not only complete failure, but also degraded conditions due to undetected, but corrupted data (latent defects). The model can represent user defined distributions for completion of "background scrubbing" to correct (remove) corrupted data. Sequential Monte Carlo simulation is used to determine the number of double disk failures expected as a function of time. RAID group can be any size up to 25. The results are presented as mean cumulative failure distributions for the RAID group. Results estimate the number of double disk failures can be as much as 5000 times greater than that predicted over 10 years when using the mean time to data loss method or Markov models when the characteristic lives of the input distributions is the same. Model results are compared to actual field data for two HDD families and two different RAID group sizes and show good correlation. Results show the rate of occurrence of failure for the RAID group may be increasing, decreasing or constant depending on the parameters used for the four input distributions

    Achieving High Reliability and Efficiency in Maintaining Large-Scale Storage Systems through Optimal Resource Provisioning and Data Placement

    Get PDF
    With the explosive increase in the amount of data being generated by various applications, large-scale distributed and parallel storage systems have become common data storage solutions and been widely deployed and utilized in both industry and academia. While these high performance storage systems significantly accelerate the data storage and retrieval, they also bring some critical issues in system maintenance and management. In this dissertation, I propose three methodologies to address three of these critical issues. First, I develop an optimal resource management and spare provisioning model to minimize the impact brought by component failures and ensure a highly operational experience in maintaining large-scale storage systems. Second, in order to cost-effectively integrate solid-state drives (SSD) into large-scale storage systems, I design a holistic algorithm which can adaptively predict the popularity of data objects by leveraging temporal locality in their access pattern and adjust their placement among solid-state drives and regular hard disk drives so that the data access throughput as well as the storage space efficiency of the large-scale heterogeneous storage systems can be improved. Finally, I propose a new checkpoint placement optimization model which can maximize the computation efficiency of large-scale scientific applications while guarantee the endurance requirements of the SSD-based burst buffer in high performance hierarchical storage systems. All these models and algorithms are validated through extensive evaluation using data collected from deployed large-scale storage systems and the evaluation results demonstrate our models and algorithms can significantly improve the reliability and efficiency of large-scale distributed and parallel storage systems
    corecore