13 research outputs found

    RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

    Full text link
    RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876

    Developing New Power Management and High-Reliability Schemes in Data-Intensive Environment

    Get PDF
    With the increasing popularity of data-intensive applications as well as the large-scale computing and storage systems, current data centers and supercomputers are often dealing with extremely large data-sets. To store and process this huge amount of data reliably and energy-efficiently, three major challenges should be taken into consideration for the system designers. Firstly, power conservation–Multicore processors or CMPs have become a mainstream in the current processor market because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance and power-saving is a critical issue which needs to be solved effectively. Secondly, system reliability–Reliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Thirdly, energy efficiency–The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of exascale supercomputers could provide accurate simulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years. This dissertation proposes new solutions to address the above three key challenges for current large-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax, PID and MPC. Secondly, we create a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout. Thirdly, we develop a power-aware job scheduler by applying a rule based control method and taking into account real world power and speedup profiles to improve power efficiency while adhering to predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baseline scheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducing a Power Performance Factor (PPF) based on the real world power and speedup profiles, we are able to increase the power efficiency by up to 75%

    Studies of disk arrays tolerating two disk failures and a proposal for a heterogeneous disk array

    Get PDF
    There has been an explosion in the amount of generated data in the past decade. Online access to these data is made possible by large disk arrays, especially in the RAID (Redundant Array of Independent Disks) paradigm. According to the RAID level a disk array can tolerate one or more disk failures, so that the storage subsystem can continue operating with disk failure(s). RAID 5 is a single disk failure tolerant array which dedicates the capacity of one disk to parity information. The content on the failed disk can be reconstructed on demand and written onto a spare disk. However, RAID5 does not provide enough protection for data since the data loss may occur when there is a media failure (unreadable sectors) or a second disk failure during the rebuild process. Due to the high cost of downtime in many applications, two disk failure tolerant arrays, such as RAID6 and EVENODD, have become popular. These schemes use 2/N of the capacity of the array for redundant information in order to tolerate two disk failures. RM2 is another scheme that can tolerate two disk failures, with slightly higher redundancy ratio. However, the performance of these two disk failure tolerant RAID schemes is impaired, since there are two check disks to be updated for each write request. Therefore, their performance, especially when there are disk failure(s), is of interest. In the first part of the dissertation, the operations for the RAID5, RAID6, EVENODD and RM2 schemes are described. A cost model is developed for these RAID schemes by analyzing the operations in various operating modes. This cost model offers a measure of the volume of data being transmitted, and provides adevice-independent comparison of the efficiency of these RAID schemes. Based on this cost model, the maximum throughput of a RAID scheme can be obtained given detailed disk characteristic and RAID configuration. Utilizing M/G/1 queuing model and other favorable modeling assumptions, a queuing analysis to obtain the mean read response time is described. Simulation is used to validate analytic results, as well as to evaluate the RAID systems in analytically intractable cases. The second part of this dissertation describes a new disk array architecture, namely Heterogeneous Disk Array (HDA). The HDA is motivated by a few observations of the trends in storage technology. The HDA architecture allows a disk array to have two forms of heterogeneity: (1) device heterogeneity, i.e., disks of different types can be incorporated in a single HDA; and (2) RAID level heterogeneity, i.e., various RAID schemes can coexist in the same array. The goal of this architecture is (1) utilizing the extra resource (i.e. bandwidth and capacity) introduced by new disk drives in an automated and efficient way; and (2) using appropriate RAID levels to meet the varying availability requirements for different applications. In HDA, each new object is associated with an appropriate RAID level and the allocation is carried out in a way to keep disk bandwidth and capacity utilizations balanced. Design considerations for the data structures of HDA metadata are described, followed by the actual design of the data structures and flowcharts for the most frequent operations. Then a data allocation algorithm is described in detail. Finally, the HDA architecture is prototyped based on the DASim simulation toolkit developed at NJIT and simulation results of an HDA with two RAID levels (RAID 1 and RAIDS) are presented

    Reducing the Overhead of Memory Space, Network Communication and Disk I/O for Analytic Frameworks in Big Data Ecosystem

    Get PDF
    To facilitate big data processing, many distributed analytic frameworks and storage systems such as Apache Hadoop, Apache Hama, Apache Spark and Hadoop Distributed File System (HDFS) have been developed. Currently, many researchers are conducting research to either make them more scalable or enabling them to support more analysis applications. In my PhD study, I conducted three main works in this topic, which are minimizing the communication delay in Apache Hama, minimizing the memory space and computational overhead in HDFS and minimizing the disk I/O overhead for approximation applications in Hadoop ecosystem. Specifically, In Apache Hama, communication delay makes up a large percentage of the overall graph processing time. While most recent research has focused on reducing the number of network messages, we add a runtime communication and computation scheduler to overlap them as much as possible. As a result, communication delay can be mitigated. In HDFS, the block location table and its corresponding maintenance could occupy more than half of the memory space and 30% of processing capacity in master node, which severely limit the scalability and performance of master node. We propose Deister that uses deterministic mathematical calculations to eliminate the huge table for storing the block locations and its corresponding maintenance. My third work proposes to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Existing offline sampling based approximation systems are not adaptive to dynamic query workloads and online sampling based approximation systems suffer from low I/O efficiency and poor estimation accuracy. Therefore, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system at a very small cost, and make good use of such information to facilitate online sampling

    ClusterRAID: Architecture and Prototype of a Distributed Fault-Tolerant Mass Storage System for Clusters

    Get PDF
    During the past few years clusters built from commodity off-the-shelf (COTS) components have emerged as the predominant supercomputer architecture. Typically comprising a collection of standard PCs or workstations and an interconnection network, they have replaced the traditionally used integrated systems due to their better price/performance ratio. As paradigms shift from mere computing intensive to I/O intensive applications, mass storage solutions for cluster installations become a more and more crucial aspect of these systems. The inherent unreliability of the underlying components is one of the reasons why no system has been established as a standard storage solution for clusters yet. This thesis sets out the architecture and prototype implementation of a novel distributed mass storage system for commodity off-the-shelf clusters and addresses the issue of the unreliable constituent components. The key concept of the presented system is the conversion of the local hard disk drive of a cluster node into a reliable device while preserving the block device interface. By the deployment of sophisticated erasure-correcting codes, the system allows the adjustment of the number of tolerable failures and thus the overall reliability. In addition, the applied data layout considers the access behaviour of a broad range of applications and minimizes the number of required network transactions. Extensive measurements and functionality tests of the prototype, both stand-alone and in conjunction with local or distributed file systems, show the validity of the concept

    Using Rotational Mirrored Declustering for Replica Placement in a Disk-Array-Based Video Server

    No full text
    In a video-on-demand (VOD) environment, disk-arrays are often used to support the disk bandwidth requirement. This can pose serious problems on available disk bandwidth upon disk failure. In this paper, we explore the approach of replicating frequently accessed movies to provide high data bandwidth and fault-tolerance required in a disk-array-based video server. An isochronous continuous video stream imposes different requirements from a random access pattern on databases or files. Explicitly, we propose a new replica placement method, called rotational mirrored declustering (RMD), to support high data availability for disk arrays in a VOD environment. In essence, RMD is similar to the conventional mirrored declustering in that replicas are stored in different disk arrays, however different from the latter in that the replica placements in different disk arrays under RMD are properly rotated. Combining the merits of prior chained and mirrored declustering methods,RMD is particularly s..

    Arquitectura multiagente para E/S de alto rendimiento en clusters

    Full text link
    La E/S constituye en la actualidad uno de los principales cuellos de botella de los sistemas distribuidos de propósito general, debido al desequilibrio existente entre el tiempo de cómputo y de E/S. Una de las soluciones propuestas para este problema ha sido el uso de la E/S paralela. En esta área, se han originado un gran número de bibliotecas de E/S paralela y sistemas de ficheros paralelos. Este tipo de sistemas adolecen de algunos defectos y carencias. Muchos de ellos están concebidos para máquinas paralelas y no se integran adecuadamente en entornos distribuidos y clusters. El uso intensivo de clusters de estaciones de trabajo durante estos últimos años hace que este tipo de sistemas no sean adecuados en el escenario de computación actual. Otros sistemas, que se adaptan a este tipo de entornos, no incluyen capacidades de reconfiguración dinámica, por lo que tienen una funcionalidad limitada. Por último, la mayoría de los sistemas de E/S que utilizan diferentes optimizaciones de E/S, no ofrecen flexibilidad a las aplicaciones para hacer uso de las mismas, intentando ocultar al usuario este tipo de técnicas. No obstante, a fin de optimizar las operaciones de E/S, es importante que las aplicaciones sean capaces de describir sus patrones de acceso, interactuando con el sistema de E/S. En otro ámbito, dentro del área de los sistemas distribuidos se encuentra el paradigma de agentes, que permite dotar a las aplicaciones de un conjunto de propiedades muy adecuadas para su adaptación a entornos complejos y dinámicos. Las características de este paradigma lo hacen a priori prometedor para abordar algunos de los problemas existentes en el campo de la E/S paralela. Esta tesis propone una solución a la problemática actual de E/S a través de tres líneas principales: (i) el uso de la teoría de agentes en sistemas de E/S de alto rendimiento, (ii) la definición de un formalismo que permita la reconfiguración dinámica de nodos de almacenamiento en un cluster y (iii) el uso de técnicas de optimización de E/S configurables y orientadas a las aplicaciones

    Removal of antagonistic spindle forces can rescue metaphase spindle length and reduce chromosome segregation defects

    Get PDF
    Regular Abstracts - Tuesday Poster Presentations: no. 1925Metaphase describes a phase of mitosis where chromosomes are attached and oriented on the bipolar spindle for subsequent segregation at anaphase. In diverse cell types, the metaphase spindle is maintained at a relatively constant length. Metaphase spindle length is proposed to be regulated by a balance of pushing and pulling forces generated by distinct sets of spindle microtubules and their interactions with motors and microtubule-associated proteins (MAPs). Spindle length appears important for chromosome segregation fidelity, as cells with shorter or longer than normal metaphase spindles, generated through deletion or inhibition of individual mitotic motors or MAPs, showed chromosome segregation defects. To test the force balance model of spindle length control and its effect on chromosome segregation, we applied fast microfluidic temperature-control with live-cell imaging to monitor the effect of switching off different combinations of antagonistic forces in the fission yeast metaphase spindle. We show that spindle midzone proteins kinesin-5 cut7p and microtubule bundler ase1p contribute to outward pushing forces, and spindle kinetochore proteins kinesin-8 klp5/6p and dam1p contribute to inward pulling forces. Removing these proteins individually led to aberrant metaphase spindle length and chromosome segregation defects. Removing these proteins in antagonistic combination rescued the defective spindle length and, in some combinations, also partially rescued chromosome segregation defects. Our results stress the importance of proper chromosome-to-microtubule attachment over spindle length regulation for proper chromosome segregation.postprin
    corecore