12,281 research outputs found
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
Cold Storage Data Archives: More Than Just a Bunch of Tapes
The abundance of available sensor and derived data from large scientific
experiments, such as earth observation programs, radio astronomy sky surveys,
and high-energy physics already exceeds the storage hardware globally
fabricated per year. To that end, cold storage data archives are the---often
overlooked---spearheads of modern big data analytics in scientific,
data-intensive application domains. While high-performance data analytics has
received much attention from the research community, the growing number of
problems in designing and deploying cold storage archives has only received
very little attention.
In this paper, we take the first step towards bridging this gap in knowledge
by presenting an analysis of four real-world cold storage archives from three
different application domains. In doing so, we highlight (i) workload
characteristics that differentiate these archives from traditional,
performance-sensitive data analytics, (ii) design trade-offs involved in
building cold storage systems for these archives, and (iii) deployment
trade-offs with respect to migration to the public cloud. Based on our
analysis, we discuss several other important research challenges that need to
be addressed by the data management community
When Queueing Meets Coding: Optimal-Latency Data Retrieving Scheme in Storage Clouds
In this paper, we study the problem of reducing the delay of downloading data
from cloud storage systems by leveraging multiple parallel threads, assuming
that the data has been encoded and stored in the clouds using fixed rate
forward error correction (FEC) codes with parameters (n, k). That is, each file
is divided into k equal-sized chunks, which are then expanded into n chunks
such that any k chunks out of the n are sufficient to successfully restore the
original file. The model can be depicted as a multiple-server queue with
arrivals of data retrieving requests and a server corresponding to a thread.
However, this is not a typical queueing model because a server can terminate
its operation, depending on when other servers complete their service (due to
the redundancy that is spread across the threads). Hence, to the best of our
knowledge, the analysis of this queueing model remains quite uncharted.
Recent traces from Amazon S3 show that the time to retrieve a fixed size
chunk is random and can be approximated as a constant delay plus an i.i.d.
exponentially distributed random variable. For the tractability of the
theoretical analysis, we assume that the chunk downloading time is i.i.d.
exponentially distributed. Under this assumption, we show that any
work-conserving scheme is delay-optimal among all on-line scheduling schemes
when k = 1. When k > 1, we find that a simple greedy scheme, which allocates
all available threads to the head of line request, is delay optimal among all
on-line scheduling schemes. We also provide some numerical results that point
to the limitations of the exponential assumption, and suggest further research
directions.Comment: Original accepted by IEEE Infocom 2014, 9 pages. Some statements in
the Infocom paper are correcte
Incremental Consistency Guarantees for Replicated Objects
Programming with replicated objects is difficult. Developers must face the
fundamental trade-off between consistency and performance head on, while
struggling with the complexity of distributed storage stacks. We introduce
Correctables, a novel abstraction that hides most of this complexity, allowing
developers to focus on the task of balancing consistency and performance. To
aid developers with this task, Correctables provide incremental consistency
guarantees, which capture successive refinements on the result of an ongoing
operation on a replicated object. In short, applications receive both a
preliminary---fast, possibly inconsistent---result, as well as a
final---consistent---result that arrives later.
We show how to leverage incremental consistency guarantees by speculating on
preliminary values, trading throughput and bandwidth for improved latency. We
experiment with two popular storage systems (Cassandra and ZooKeeper) and three
applications: a Twissandra-based microblogging service, an ad serving system,
and a ticket selling system. Our evaluation on the Amazon EC2 platform with
YCSB workloads A, B, and C shows that we can reduce the latency of strongly
consistent operations by up to 40% (from 100ms to 60ms) at little cost (10%
bandwidth increase, 6% throughput drop) in the ad system. Even if the
preliminary result is frequently inconsistent (25% of accesses), incremental
consistency incurs a bandwidth overhead of only 27%.Comment: 16 total pages, 12 figures. OSDI'16 (to appear
- …