3 research outputs found
Evaluating Impact of Human Errors on the Availability of Data Storage Systems
In this paper, we investigate the effect of incorrect disk replacement
service on the availability of data storage systems. To this end, we first
conduct Monte Carlo simulations to evaluate the availability of disk subsystem
by considering disk failures and incorrect disk replacement service. We also
propose a Markov model that corroborates the Monte Carlo simulation results. We
further extend the proposed model to consider the effect of automatic disk
fail-over policy. The results obtained by the proposed model show that
overlooking the impact of incorrect disk replacement can result up to three
orders of magnitude unavailability underestimation. Moreover, this study
suggests that by considering the effect of human errors, the conventional
believes about the dependability of different RAID mechanisms should be
revised. The results show that in the presence of human errors, RAID1 can
result in lower availability compared to RAID5
LBICA: A Load Balancer for I/O Cache Architectures
In recent years, enterprise Solid-State Drives (SSDs) are used in the caching
layer of high-performance servers to close the growing performance gap between
processing units and storage subsystem. SSD-based I/O caching is typically not
effective in workloads with burst accesses in which the caching layer itself
becomes the performance bottleneck because of the large number of accesses.
Existing I/O cache architectures mainly focus on maximizing the cache hit ratio
while they neglect the average queue time of accesses. Previous studies
suggested bypassing the cache when burst accesses are identified. These
schemes, however, are not applicable to a general cache configuration and also
result in significant performance degradation on burst accesses. In this paper,
we propose a novel I/O cache load balancing scheme (LBICA) with adaptive write
policy management to prevent the I/O cache from becoming performance bottleneck
in burst accesses. Our proposal, unlike previous schemes, which disable the I/O
cache or bypass the requests into the disk subsystem in burst accesses,
selectively reduces the number of waiting accesses in the SSD queue and
balances the load between the I/O cache and the disk subsystem while providing
the maximum performance. The proposed scheme characterizes the workload based
on the type of in-queue requests and assigns an effective cache write policy.
We aim to bypass the accesses which 1) are served faster by the disk subsystem
or 2) cannot be merged with other accesses in the I/O cache queue. Doing so,
the selected requests are responded by the disk layer, preventing from
overloading the I/O cache. Our evaluations on a physical system shows that
LBICA reduces the load on the I/O cache by 48% and improves the performance of
burst workloads by 30% compared to the latest state-of-the-art load balancing
scheme.Comment: 6 page
Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems
Data storage systems and their availability play a crucial role in
contemporary datacenters. Despite using mechanisms such as automatic fail-over
in datacenters, the role of human agents and consequently their destructive
errors is inevitable. Due to very large number of disk drives used in exascale
datacenters and their high failure rates, the disk subsystem in storage systems
has become a major source of Data Unavailability (DU) and Data Loss (DL)
initiated by human errors. In this paper, we investigate the effect of
Incorrect Disk Replacement Service (IDRS) on the availability and reliability
of data storage systems. To this end, we analyze the consequences of IDRS in a
disk array, and conduct Monte Carlo simulations to evaluate DU and DL during
mission time. The proposed modeling framework can cope with a) different
storage array configurations and b) Data Object Survivability (DOS),
representing the effect of system level redundancies such as remote backups and
mirrors. In the proposed framework, the model parameters are obtained from
industrial and scientific reports alongside field data which have been
extracted from a datacenter operating with 70 storage racks. The results show
that ignoring the impact of IDRS leads to unavailability underestimation by up
to three orders of magnitude. Moreover, our study suggests that by considering
the effect of human errors, the conventional beliefs about the dependability of
different Redundant Array of Independent Disks (RAID) mechanisms should be
revised. The results show that RAID1 can result in lower availability compared
to RAID5 in the presence of human errors. The results also show that employing
automatic fail-over policy (using hot spare disks) can reduce the drastic
impacts of human errors by two orders of magnitude.Comment: 17 page