2,037 research outputs found
Joint Latency and Cost Optimization for Erasure-coded Data Center Storage
Modern distributed storage systems offer large capacity to satisfy the
exponentially increasing need of storage space. They often use erasure codes to
protect against disk and node failures to increase reliability, while trying to
meet the latency requirements of the applications and clients. This paper
provides an insightful upper bound on the average service delay of such
erasure-coded storage with arbitrary service time distribution and consisting
of multiple heterogeneous files. Not only does the result supersede known delay
bounds that only work for a single file or homogeneous files, it also enables a
novel problem of joint latency and storage cost minimization over three
dimensions: selecting the erasure code, placement of encoded chunks, and
optimizing scheduling policy. The problem is efficiently solved via the
computation of a sequence of convex approximations with provable convergence.
We further prototype our solution in an open-source, cloud storage deployment
over three geographically distributed data centers. Experimental results
validate our theoretical delay analysis and show significant latency reduction,
providing valuable insights into the proposed latency-cost tradeoff in
erasure-coded storage.Comment: 14 pages, presented in part at IFIP Performance, Oct 201
Speeding Up Distributed Machine Learning Using Codes
Codes are widely used in many engineering applications to offer robustness
against noise. In large-scale systems there are several types of noise that can
affect the performance of distributed machine learning algorithms -- straggler
nodes, system failures, or communication bottlenecks -- but there has been
little interaction cutting across codes, machine learning, and distributed
systems. In this work, we provide theoretical insights on how coded solutions
can achieve significant gains compared to uncoded ones. We focus on two of the
most basic building blocks of distributed learning algorithms: matrix
multiplication and data shuffling. For matrix multiplication, we use codes to
alleviate the effect of stragglers, and show that if the number of homogeneous
workers is , and the runtime of each subtask has an exponential tail, coded
computation can speed up distributed matrix multiplication by a factor of . For data shuffling, we use codes to reduce communication bottlenecks,
exploiting the excess in storage. We show that when a constant fraction
of the data matrix can be cached at each worker, and is the number
of workers, \emph{coded shuffling} reduces the communication cost by a factor
of compared to uncoded shuffling, where
is the ratio of the cost of unicasting messages to users to
multicasting a common message (of the same size) to users. For instance,
if multicasting a message to users is as cheap as
unicasting a message to one user. We also provide experiment results,
corroborating our theoretical gains of the coded algorithms.Comment: This work is published in IEEE Transactions on Information Theory and
presented in part at the NIPS 2015 Workshop on Machine Learning Systems and
the IEEE ISIT 201
Applied Erasure Coding in Networks and Distributed Storage
The amount of digital data is rapidly growing. There is an increasing use of
a wide range of computer systems, from mobile devices to large-scale data
centers, and important for reliable operation of all computer systems is
mitigating the occurrence and the impact of errors in digital data. The demand
for new ultra-fast and highly reliable coding techniques for data at rest and
for data in transit is a major research challenge. Reliability is one of the
most important design requirements. The simplest way of providing a degree of
reliability is by using data replication techniques. However, replication is
highly inefficient in terms of capacity utilization. Erasure coding has
therefore become a viable alternative to replication since it provides the same
level of reliability as replication with significantly less storage overhead.
The present thesis investigates efficient constructions of erasure codes for
different applications. Methods from both coding and information theory have
been applied to network coding, Optical Packet Switching (OPS) networks and
distributed storage systems. The following four issues are addressed: -
Construction of binary and non-binary erasure codes; - Reduction of the header
overhead due to the encoding coefficients in network coding; - Construction and
implementation of new erasure codes for large-scale distributed storage systems
that provide savings in the storage and network resources compared to
state-of-the-art codes; and - Provision of a unified view on Quality of Service
(QoS) in OPS networks when erasure codes are used, with the focus on Packet
Loss Rate (PLR), survivability and secrecy
Video Streaming in Distributed Erasure-coded Storage Systems: Stall Duration Analysis
The demand for global video has been burgeoning across industries. With the
expansion and improvement of video-streaming services, cloud-based video is
evolving into a necessary feature of any successful business for reaching
internal and external audiences. This paper considers video streaming over
distributed systems where the video segments are encoded using an erasure code
for better reliability thus being the first work to our best knowledge that
considers video streaming over erasure-coded distributed cloud systems. The
download time of each coded chunk of each video segment is characterized and
ordered statistics over the choice of the erasure-coded chunks is used to
obtain the playback time of different video segments. Using the playback times,
bounds on the moment generating function on the stall duration is used to bound
the mean stall duration. Moment generating function based bounds on the ordered
statistics are also used to bound the stall duration tail probability which
determines the probability that the stall time is greater than a pre-defined
number. These two metrics, mean stall duration and the stall duration tail
probability, are important quality of experience (QoE) measures for the end
users. Based on these metrics, we formulate an optimization problem to jointly
minimize the convex combination of both the QoE metrics averaged over all
requests over the placement and access of the video content. The non-convex
problem is solved using an efficient iterative algorithm. Numerical results
show significant improvement in QoE metrics for cloud-based video as compared
to the considered baselines.Comment: 18 pages, accepted to IEEE/ACM Transactions on Networkin
Compressed Differential Erasure Codes for Efficient Archival of Versioned Data
In this paper, we study the problem of storing an archive of versioned data
in a reliable and efficient manner in distributed storage systems. We propose a
new storage technique called differential erasure coding (DEC) where the
differences (deltas) between subsequent versions are stored rather than the
whole objects, akin to a typical delta encoding technique. However, unlike
delta encoding techniques, DEC opportunistically exploits the sparsity (i.e.,
when the differences between two successive versions have few non-zero entries)
in the updates to store the deltas using compressed sensing techniques applied
with erasure coding. We first show that DEC provides significant savings in the
storage size for versioned data whenever the update patterns are characterized
by in-place alterations. Subsequently, we propose a practical DEC framework so
as to reap storage size benefits against not just in-place alterations but also
real-world update patterns such as insertions and deletions that alter the
overall data sizes. We conduct experiments with several synthetic workloads to
demonstrate that the practical variant of DEC provides significant reductions
in storage overhead (up to 60\% depending on the workload) compared to baseline
storage system which incorporates concepts from Rsync, a delta encoding
technique to store and synchronize data across a network.Comment: 16 pages, 15 figure
Taming Tail Latency for Erasure-coded, Distributed Storage Systems
Distributed storage systems are known to be susceptible to long tails in
response time. In modern online storage systems such as Bing, Facebook, and
Amazon, the long tails of the service latency are of particular concern. with
99.9th percentile response times being orders of magnitude worse than the mean.
As erasure codes emerge as a popular technique to achieve high data reliability
in distributed storage while attaining space efficiency, taming tail latency
still remains an open problem due to the lack of mathematical models for
analyzing such systems. To this end, we propose a framework for quantifying and
optimizing tail latency in erasure-coded storage systems. In particular, we
derive upper bounds on tail latency in closed form for arbitrary service time
distribution and heterogeneous files. Based on the model, we formulate an
optimization problem to jointly minimize the weighted latency tail probability
of all files over the placement of files on the servers, and the choice of
servers to access the requested files. The non-convex problem is solved using
an efficient, alternating optimization algorithm. Numerical results show
significant reduction of tail latency for erasure-coded storage systems with a
realistic workload.Comment: 11 pages, 8 figure
Review of Replication Schemes for Unstructured P2P Networks
To improve unstructured P2P system performance, one wants to minimize the
number of peers that have to be probed for the shortening of the search time. A
solution to the problem is to employ a replication scheme, which provides high
hit rate for target files. Replication can also provide load balancing and
reduce access latency if the file is accessed by a large population of users.
This paper briefly describes various replication schemes that have appeared in
the literature and also focuses on a novel replication technique called
Q-replication to increase availability of objects in unstructured P2P networks.
The Q-replication technique replicates objects autonomously to suitable sites
based on object popularity and site selection logic by extensively employing
Q-learning concept.Comment: 7 page
Survey of Search and Replication Schemes in Unstructured P2P Networks
P2P computing lifts taxing issues in various areas of computer science. The
largely used decentralized unstructured P2P systems are ad hoc in nature and
present a number of research challenges. In this paper, we provide a
comprehensive theoretical survey of various state-of-the-art search and
replication schemes in unstructured P2P networks for file-sharing applications.
The classifications of search and replication techniques and their advantages
and disadvantages are briefly explained. Finally, the various issues on
searching and replication for unstructured P2P networks are discussed.Comment: 39 Pages 5 Figure
Multi-Version Coding - An Information Theoretic Perspective of Consistent Distributed Storage
In applications of distributed storage systems to distributed computing and
implementation of key- value stores, the following property, usually referred
to as consistency in computer science and engineering, is an important
requirement: as the data stored changes, the latest version of the data must be
accessible to a client that connects to the storage system. An information
theoretic formulation called multi-version coding is introduced in the paper,
in order to study storage costs of consistent distributed storage systems.
Multi-version coding is characterized by {\nu} totally ordered versions of a
message, and a storage system with n servers. At each server, values
corresponding to an arbitrary subset of the {\nu} versions are received and
encoded. For any subset of c servers in the storage system, the value
corresponding to the latest common version, or a later version as per the total
ordering, among the c servers is required to be decodable. An achievable
multi-version code construction via linear coding and a converse result that
shows that the construction is approximately tight, are provided. An
implication of the converse is that there is an inevitable price, in terms of
storage cost, to ensure consistency in distributed storage systems.Comment: 30 Pages. Extended version of conference publications in ISIT 2014
and Allerton 2014. Revision adds a section, Section VII, and corrects minor
typographical errors in the rest of the documen
Modeling and Optimization of Latency in Erasure-coded Storage Systems
As consumers are increasingly engaged in social networking and E-commerce
activities, businesses grow to rely on Big Data analytics for intelligence, and
traditional IT infrastructures continue to migrate to the cloud and edge, these
trends cause distributed data storage demand to rise at an unprecedented speed.
Erasure coding has seen itself quickly emerged as a promising technique to
reduce storage cost while providing similar reliability as replicated systems,
widely adopted by companies like Facebook, Microsoft and Google. However, it
also brings new challenges in characterizing and optimizing the access latency
when erasure codes are used in distributed storage. The aim of this monograph
is to provide a review of recent progress (both theoretical and practical) on
systems that employ erasure codes for distributed storage.
In this monograph, we will first identify the key challenges and taxonomy of
the research problems and then give an overview of different approaches that
have been developed to quantify and model latency of erasure-coded storage.
This includes recent work leveraging MDS-Reservation, Fork-Join, Probabilistic,
and Delayed-Relaunch scheduling policies, as well as their applications to
characterize access latency (e.g., mean, tail, asymptotic latency) of
erasure-coded distributed storage systems. We will also extend the problem to
the case when users are streaming videos from erasure-coded distributed storage
systems. Next, we bridge the gap between theory and practice, and discuss
lessons learned from prototype implementation. In particular, we will discuss
exemplary implementations of erasure-coded storage, illuminate key design
degrees of freedom and tradeoffs, and summarize remaining challenges in
real-world storage systems such as in content delivery and caching. Open
problems for future research are discussed at the end of each chapter.Comment: Monograph for use by researchers interested in latency aspects of
distributed storage system
- …