68,308 research outputs found
Recommended from our members
Global-Scale Data Management with Strong Consistency Guarantees
Global-scale data management(GSDM) empowers systems by providing higher levels of fault-tolerance, read availability, and efficiency in utilizing cloud resources. This has led to the emergence of global-scale data management and event processing. However, the Wide-Area Network (WAN) latency separating datacenters is orders of magnitude larger than typical network latencies, and this requires a reevaluation of many of the traditional design trade-offs of data management systems. Therefore, data management problems must be revisited to account for the new design space. In this dissertation, we propose theoretical foundations to understand the limits imposed by WAN latency on GSDM, and propose practical systems and protocols to minimize the overhead caused by WAN latency. The presented work spans global-scale transaction processing, communication, analytics, and machine learning. In all these directions, the focus is on the trade-off between consistency and latency, where we ask the question: what is the best performance (often latency) we can achieve without compromising the consistency and integrity of data? For transaction processing, we propose a lower-bound formulation for transaction latency that is imposed by the WAN latency. Also, we propose a new paradigm for transaction processing (proactive coordination) that inspired out two proposed protocols, Message Futures and Helios, which can achieve the lower-bound latency. We also propose a communication framework, called Chariots, to scale multi-datacenter communication. Chariots is carefully designed to allow scaling communication while providing a consistent view of the communicated information. Finally, we explore challenges in global-scale analytics and machine learning. Specifically, we propose Ogre, a scalable system for global-scale heterogeneous transactional and analytics workloads. Also, we propose COP, a system designed to speed up machine learning on globally generated data
Crux: Locality-Preserving Distributed Services
Distributed systems achieve scalability by distributing load across many
machines, but wide-area deployments can introduce worst-case response latencies
proportional to the network's diameter. Crux is a general framework to build
locality-preserving distributed systems, by transforming an existing scalable
distributed algorithm A into a new locality-preserving algorithm ALP, which
guarantees for any two clients u and v interacting via ALP that their
interactions exhibit worst-case response latencies proportional to the network
latency between u and v. Crux builds on compact-routing theory, but generalizes
these techniques beyond routing applications. Crux provides weak and strong
consistency flavors, and shows latency improvements for localized interactions
in both cases, specifically up to several orders of magnitude for
weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab
locality-preserving versions of a Memcached distributed cache, a Bamboo
distributed hash table, and a Redis publish/subscribe. Our results indicate
that Crux is effective and applicable to a variety of existing distributed
algorithms.Comment: 11 figure
ElasTraS: An Elastic Transactional Data Store in the Cloud
Over the last couple of years, "Cloud Computing" or "Elastic Computing" has
emerged as a compelling and successful paradigm for internet scale computing.
One of the major contributing factors to this success is the elasticity of
resources. In spite of the elasticity provided by the infrastructure and the
scalable design of the applications, the elephant (or the underlying database),
which drives most of these web-based applications, is not very elastic and
scalable, and hence limits scalability. In this paper, we propose ElasTraS
which addresses this issue of scalability and elasticity of the data store in a
cloud computing environment to leverage from the elastic nature of the
underlying infrastructure, while providing scalable transactional data access.
This paper aims at providing the design of a system in progress, highlighting
the major design choices, analyzing the different guarantees provided by the
system, and identifying several important challenges for the research community
striving for computing in the cloud.Comment: 5 Pages, In Proc. of USENIX HotCloud 200
Petuum: A New Platform for Distributed Machine Learning on Big Data
What is a systematic way to efficiently apply a wide spectrum of advanced ML
programs to industrial scale problems, using Big Models (up to 100s of billions
of parameters) on Big Data (up to terabytes or petabytes)? Modern
parallelization strategies employ fine-grained operations and scheduling beyond
the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of
ML programs. The variety of approaches tends to pull systems and algorithms
design in different directions, and it remains difficult to find a universal
platform applicable to a wide range of ML programs at scale. We propose a
general-purpose framework that systematically addresses data- and
model-parallel challenges in large-scale ML, by observing that many ML programs
are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities
for an integrative system design, such as bounded-error network synchronization
and dynamic scheduling based on ML program structure. We demonstrate the
efficacy of these system designs versus well-known implementations of modern ML
algorithms, allowing ML programs to run in much less time and at considerably
larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
SAFIUS - A secure and accountable filesystem over untrusted storage
We describe SAFIUS, a secure accountable file system that resides over an
untrusted storage. SAFIUS provides strong security guarantees like
confidentiality, integrity, prevention from rollback attacks, and
accountability. SAFIUS also enables read/write sharing of data and provides the
standard UNIX-like interface for applications. To achieve accountability with
good performance, it uses asynchronous signatures; to reduce the space required
for storing these signatures, a novel signature pruning mechanism is used.
SAFIUS has been implemented on a GNU/Linux based system modifying OpenGFS.
Preliminary performance studies show that SAFIUS has a tolerable overhead for
providing secure storage: while it has an overhead of about 50% of OpenGFS in
data intensive workloads (due to the overhead of performing
encryption/decryption in software), it is comparable (or better in some cases)
to OpenGFS in metadata intensive workloads.Comment: 11pt, 12 pages, 16 figure
High-Performance Distributed ML at Scale through Parameter Server Consistency Models
As Machine Learning (ML) applications increase in data size and model
complexity, practitioners turn to distributed clusters to satisfy the increased
computational and memory demands. Unfortunately, effective use of clusters for
ML requires considerable expertise in writing distributed code, while
highly-abstracted frameworks like Hadoop have not, in practice, approached the
performance seen in specialized ML implementations. The recent Parameter Server
(PS) paradigm is a middle ground between these extremes, allowing easy
conversion of single-machine parallel ML applications into distributed ones,
while maintaining high throughput through relaxed "consistency models" that
allow inconsistent parameter reads. However, due to insufficient theoretical
study, it is not clear which of these consistency models can really ensure
correct ML algorithm output; at the same time, there remain many
theoretically-motivated but undiscovered opportunities to maximize
computational throughput. Motivated by this challenge, we study both the
theoretical guarantees and empirical behavior of iterative-convergent ML
algorithms in existing PS consistency models. We then use the gleaned insights
to improve a consistency model using an "eager" PS communication mechanism, and
implement it as a new PS system that enables ML algorithms to reach their
solution more quickly.Comment: 19 pages, 2 figure
- …