15 research outputs found
Efficient data mappings for parity-declustered data layouts
AbstractThe joint demands of high performance and fault tolerance in a large array of disks can be satisfied by a parity-declustered data layout. Such a data layout is generated by partitioning the data on the disks into stripes and choosing a part of each stripe to hold redundant information. Thus the data layout can be represented as a table of stripes. The data mapping problem is the problem of translating a data address into a disk identifier and an offset on that disk. Recent work has yielded mappings that compute disks and offsets directly from data addresses without the need to store tables. In this paper, we show that parity-declustered data layouts based on commutative rings yield mappings with improved computational efficiency and wider applicability
RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)
RAID proposal advocated replacing large disks with arrays of PC disks, but as
the capacity of small disks increased 100-fold in 1990s the production of large
disks was discontinued. Storage dependability is increased via replication or
erasure coding. Cloud storage providers store multiple copies of data obviating
for need for further redundancy. Varitaions of RAID based on local recovery
codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs
have low latency and high bandwidth, are more reliable, consume less power and
have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial
text overlap with arXiv:2306.0876
Scalability of RAID systems
RAID systems (Redundant Arrays of Inexpensive Disks) have dominated backend
storage systems for more than two decades and have grown continuously in size
and complexity. Currently they face unprecedented challenges from data intensive
applications such as image processing, transaction processing and data warehousing.
As the size of RAID systems increases, designers are faced with both performance and
reliability challenges. These challenges include limited back-end network bandwidth,
physical interconnect failures, correlated disk failures and long disk reconstruction
time.
This thesis studies the scalability of RAID systems in terms of both performance
and reliability through simulation, using a discrete event driven simulator for RAID
systems (SIMRAID) developed as part of this project. SIMRAID incorporates two
benchmark workload generators, based on the SPC-1 and Iometer benchmark specifications.
Each component of SIMRAID is highly parameterised, enabling it to explore
a large design space. To improve the simulation speed, SIMRAID develops a set of
abstraction techniques to extract the behaviour of the interconnection protocol without
losing accuracy. Finally, to meet the technology trend toward heterogeneous storage
architectures, SIMRAID develops a framework that allows easy modelling of different
types of device and interconnection technique.
Simulation experiments were first carried out on performance aspects of scalability.
They were designed to answer two questions: (1) given a number of disks, which
factors affect back-end network bandwidth requirements; (2) given an interconnection
network, how many disks can be connected to the system. The results show that
the bandwidth requirement per disk is primarily determined by workload features and
stripe unit size (a smaller stripe unit size has better scalability than a larger one), with
cache size and RAID algorithm having very little effect on this value. The maximum
number of disks is limited, as would be expected, by the back-end network bandwidth.
Studies of reliability have led to three proposals to improve the reliability and scalability
of RAID systems. Firstly, a novel data layout called PCDSDF is proposed.
PCDSDF combines the advantages of orthogonal data layouts and parity declustering
data layouts, so that it can not only survivemultiple disk failures caused by physical interconnect
failures or correlated disk failures, but also has a good degraded and rebuild
performance. The generating process of PCDSDF is deterministic and time-efficient.
The number of stripes per rotation (namely the number of stripes to achieve rebuild workload balance) is small. Analysis shows that the PCDSDF data layout can significantly
improve the system reliability. Simulations performed on SIMRAID confirm
the good performance of PCDSDF, which is comparable to other parity declustering
data layouts, such as RELPR.
Secondly, a system architecture and rebuilding mechanism have been designed,
aimed at fast disk reconstruction. This architecture is based on parity declustering data
layouts and a disk-oriented reconstruction algorithm. It uses stripe groups instead of
stripes as the basic distribution unit so that it can make use of the sequential nature of
the rebuilding workload. The design space of system factors such as parity declustering
ratio, chunk size, private buffer size of surviving disks and free buffer size are explored
to provide guidelines for storage system design.
Thirdly, an efficient distributed hot spare allocation and assignment algorithm for
general parity declustering data layouts has been developed. This algorithm avoids
conflict problems in the process of assigning distributed spare space for the units on
the failed disk. Simulation results show that it effectively solves the write bottleneck
problem and, at the same time, there is only a small increase in the average response
time to user requests
Developing New Power Management and High-Reliability Schemes in Data-Intensive Environment
With the increasing popularity of data-intensive applications as well as the large-scale computing and storage systems, current data centers and supercomputers are often dealing with extremely large data-sets. To store and process this huge amount of data reliably and energy-efficiently, three major challenges should be taken into consideration for the system designers. Firstly, power conservationāMulticore processors or CMPs have become a mainstream in the current processor market because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance and power-saving is a critical issue which needs to be solved effectively. Secondly, system reliabilityāReliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Thirdly, energy efficiencyāThe fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of exascale supercomputers could provide accurate simulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years. This dissertation proposes new solutions to address the above three key challenges for current large-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax, PID and MPC. Secondly, we create a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout. Thirdly, we develop a power-aware job scheduler by applying a rule based control method and taking into account real world power and speedup profiles to improve power efficiency while adhering to predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baseline scheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducing a Power Performance Factor (PPF) based on the real world power and speedup profiles, we are able to increase the power efficiency by up to 75%
Scalable Storage for Digital Libraries
I propose a storage system optimised for digital libraries. Its key features are its heterogeneous scalability; its integration and exploitation of rich semantic metadata associated with digital objects; its use of a name space; and its aggressive performance optimisation in the digital library domain