6,550 research outputs found
Runtime Optimization of Join Location in Parallel Data Management Systems
Applications running on parallel systems often need to join a streaming
relation or a stored relation with data indexed in a parallel data storage
system. Some applications also compute UDFs on the joined tuples. The join can
be done at the data storage nodes, corresponding to reduce side joins, or by
fetching data from the storage system to compute nodes, corresponding to map
side join. Both may be suboptimal: reduce side joins may cause skew, while map
side joins may lead to a lot of data being transferred and replicated.
In this paper, we present techniques to make runtime decisions between the
two options on a per key basis, in order to improve the throughput of the join,
accounting for UDF computation if any. Our techniques are based on an extended
ski-rental algorithm and provide worst-case performance guarantees with respect
to the optimal point in the space considered by us. Our techniques use load
balancing taking into account the CPU, network and I/O costs as well as the
load on compute and storage nodes. We have implemented our techniques on
Hadoop, Spark and the Muppet stream processing engine. Our experiments show
that our optimization techniques provide a significant improvement in
throughput over existing techniques.Comment: 17 page
Kelly Cache Networks
We study networks of M/M/1 queues in which nodes act as caches that store
objects. Exogenous requests for objects are routed towards nodes that store
them; as a result, object traffic in the network is determined not only by
demand but, crucially, by where objects are cached. We determine how to place
objects in caches to attain a certain design objective, such as, e.g.,
minimizing network congestion or retrieval delays. We show that for a broad
class of objectives, including minimizing both the expected network delay and
the sum of network queue lengths, this optimization problem can be cast as an
NP- hard submodular maximization problem. We show that so-called continuous
greedy algorithm attains a ratio arbitrarily close to
using a deterministic estimation via a power series; this drastically reduces
execution time over prior art, which resorts to sampling. Finally, we show that
our results generalize, beyond M/M/1 queues, to networks of M/M/k and symmetric
M/D/1 queues.Comment: This is the extended version of the Infocom 2019 paper with the same
title. The authors gratefully acknowledge support from National Science
Foundation grant NeTS-1718355, as well as from research grants by Intel Corp.
and Cisco System
Wireless Video Caching and Dynamic Streaming under Differentiated Quality Requirements
This paper considers one-hop device-to-device (D2D)-assisted wireless caching
networks that cache video files of varying quality levels, with the assumption
that the base station can control the video quality but cache-enabled devices
cannot. Two problems arise in such a caching network: file placement problem
and node association problem. This paper suggests a method to cache videos of
different qualities, and thus of varying file sizes, by maximizing the sum of
video quality measures that users can enjoy. There exists an interesting
trade-off between video quality and video diversity, i.e., the ability to
provision diverse video files. By caching high-quality files, the cache-enabled
devices can provide high-quality video, but cannot cache a variety of files.
Conversely, when the device caches various files, it cannot provide a good
quality for file-requesting users. In addition, when multiple devices cache the
same file but their qualities are different, advanced node association is
required for file delivery. This paper proposes a node association algorithm
that maximizes time-averaged video quality for multiple users under a playback
delay constraint. In this algorithm, we also consider request collision, the
situation where several users request files from the same device at the same
time, and we propose two ways to cope with the collision: scheduling of one
user and non-orthogonal multiple access. Simulation results verify that the
proposed caching method and the node association algorithm work reliably.Comment: 13 pages, 11 figures, accepted for publication in IEEE Journal on
Selected Areas in Communication
Modeling and Optimization of Latency in Erasure-coded Storage Systems
As consumers are increasingly engaged in social networking and E-commerce
activities, businesses grow to rely on Big Data analytics for intelligence, and
traditional IT infrastructures continue to migrate to the cloud and edge, these
trends cause distributed data storage demand to rise at an unprecedented speed.
Erasure coding has seen itself quickly emerged as a promising technique to
reduce storage cost while providing similar reliability as replicated systems,
widely adopted by companies like Facebook, Microsoft and Google. However, it
also brings new challenges in characterizing and optimizing the access latency
when erasure codes are used in distributed storage. The aim of this monograph
is to provide a review of recent progress (both theoretical and practical) on
systems that employ erasure codes for distributed storage.
In this monograph, we will first identify the key challenges and taxonomy of
the research problems and then give an overview of different approaches that
have been developed to quantify and model latency of erasure-coded storage.
This includes recent work leveraging MDS-Reservation, Fork-Join, Probabilistic,
and Delayed-Relaunch scheduling policies, as well as their applications to
characterize access latency (e.g., mean, tail, asymptotic latency) of
erasure-coded distributed storage systems. We will also extend the problem to
the case when users are streaming videos from erasure-coded distributed storage
systems. Next, we bridge the gap between theory and practice, and discuss
lessons learned from prototype implementation. In particular, we will discuss
exemplary implementations of erasure-coded storage, illuminate key design
degrees of freedom and tradeoffs, and summarize remaining challenges in
real-world storage systems such as in content delivery and caching. Open
problems for future research are discussed at the end of each chapter.Comment: Monograph for use by researchers interested in latency aspects of
distributed storage system
On the Complexity of Optimal Routing and Content Caching in Heterogeneous Networks
We investigate the problem of optimal request routing and content caching in
a heterogeneous network supporting in-network content caching with the goal of
minimizing average content access delay. Here, content can either be accessed
directly from a back-end server (where content resides permanently) or be
obtained from one of multiple in-network caches. To access a piece of content,
a user must decide whether to route its request to a cache or to the back-end
server. Additionally, caches must decide which content to cache. We investigate
the problem complexity of two problem formulations, where the direct path to
the back-end server is modeled as i) a congestion-sensitive or ii) a
congestion-insensitive path, reflecting whether or not the delay of the
uncached path to the back-end server depends on the user request load,
respectively. We show that the problem is NP-complete in both cases. We prove
that under the congestion-insensitive model the problem can be solved optimally
in polynomial time if each piece of content is requested by only one user, or
when there are at most two caches in the network. We also identify a structural
property of the user-cache graph that potentially makes the problem
NP-complete. For the congestion-sensitive model, we prove that the problem
remains NP-complete even if there is only one cache in the network and each
content is requested by only one user. We show that approximate solutions can
be found for both models within a (1-1/e) factor of the optimal solution, and
demonstrate a greedy algorithm that is found to be within 1% of optimal for
small problem sizes. Through trace-driven simulations we evaluate the
performance of our greedy algorithms, which show up to a 50% reduction in
average delay over solutions based on LRU content caching.Comment: Infoco
Bandana: Using Non-volatile Memory for Storing Deep Learning Models
Typical large-scale recommender systems use deep learning models that are
stored on a large amount of DRAM. These models often rely on embeddings, which
consume most of the required memory. We present Bandana, a storage system that
reduces the DRAM footprint of embeddings, by using Non-volatile Memory (NVM) as
the primary storage medium, with a small amount of DRAM as cache. The main
challenge in storing embeddings on NVM is its limited read bandwidth compared
to DRAM. Bandana uses two primary techniques to address this limitation: first,
it stores embedding vectors that are likely to be read together in the same
physical location, using hypergraph partitioning, and second, it decides the
number of embedding vectors to cache in DRAM by simulating dozens of small
caches. These techniques allow Bandana to increase the effective read bandwidth
of NVM by 2-3x and thereby significantly reduce the total cost of ownership
Forwarding, Caching and Congestion Control in Named Data Networks
Emerging information-centric networking architectures seek to optimally
utilize both bandwidth and storage for efficient content distribution. This
highlights the need for joint design of traffic engineering and caching
strategies, in order to optimize network performance in view of both current
traffic loads and future traffic demands. We present a systematic framework for
joint dynamic interest request forwarding and dynamic cache placement and
eviction, within the context of the Named Data Networking (NDN) architecture.
The framework employs a virtual control plane which operates on the user demand
rate for data objects in the network, and an actual plane which handles
Interest Packets and Data Packets. We develop distributed algorithms within the
virtual plane to achieve network load balancing through dynamic forwarding and
caching, thereby maximizing the user demand rate that the NDN network can
satisfy. Next, we show that congestion control can be optimally combined with
forwarding and caching within this framework to maximize user utilities subject
to network stability. Numerical experiments within a number of network settings
demonstrate the superior performance of the resulting algorithms for the actual
plane in terms of high user utilities, low user delay, and high rate of cache
hits.Comment: This version of the paper contains a new section on congestion
control for NDN networks. Previous version of the paper appeared in
Proceedings of ACM Conference on Information-Centric Networking (ICN), Paris,
France, September 24-26, 201
The TrieJax Architecture: Accelerating Graph Operations Through Relational Joins
Graph pattern matching (e.g., finding all cycles and cliques) has become an
important component in many critical domains such as social networks, biology,
and cyber-security. This development motivated research to develop faster
algorithms that target graph pattern matching. In recent years, the database
community has shown that mapping graph pattern matching problems to a new class
of relational join algorithms provides an efficient framework for computing
these problems.
In this paper, we argue that this new class of relational join algorithms is
highly amenable to specialized hardware acceleration thanks to two fundamental
properties: improved locality and inherent concurrency. The improved locality
is a result of the provably bound number of intermediate results these
algorithms generate, which results in smaller working sets. In addition, their
inherent concurrency can be leveraged for effective hardware acceleration and
hiding memory latency. We demonstrate the hardware amenability of this new
class of algorithms by introducing TrieJax, a hardware accelerator for graph
pattern matching. The TrieJax design leverages the improved locality and high
concurrency properties to dramatically accelerate graph pattern matching, and
can be tightly integrated into existing manycore processors.
We evaluate TrieJax on a set standard graph pattern matching queries and
datasets. Our evaluation shows that TrieJax outperforms recently proposed
hardware accelerators for graph and database processing that do not employ the
new class of algorithms by 7-63x on average (up to 539x), while consuming
15-179x less energy (up to 1750x). systems that do incorporate modern
relational join algorithms by 9-20x on average (up to 45x), while consuming
59-110x less energy (up to 372x)
Generalization of LRU Cache Replacement Policy with Applications to Video Streaming
Caching plays a crucial role in networking systems to reduce the load on the
network and is commonly employed by content delivery networks (CDNs) in order
to improve performance. One of the commonly used mechanisms, Least Recently
Used (LRU), works well for identical file sizes. However, for asymmetric file
sizes, the performance deteriorates. This paper proposes an adaptation to the
LRU strategy, called gLRU, where the file is sub-divided into equal-sized
chunks. In this strategy, a chunk of the newly requested file is added in the
cache, and a chunk of the least-recently-used file is removed from the cache.
Even though approximate analysis for the hit rate has been studied for LRU, the
analysis does not extend to gLRU since the metric of interest is no longer the
hit rate as the cache has partial files. This paper provides a novel
approximation analysis for this policy where the cache may have partial file
contents. The approximation approach is validated by simulations. Further, gLRU
outperforms the LRU strategy for a Zipf file popularity distribution and
censored Pareto file size distribution for the file download times. Video
streaming applications can further use the partial cache contents to help the
stall duration significantly, and the numerical results indicate significant
improvements (32\%) in stall duration using the gLRU strategy as compared to
the LRU strategy. Furthermore, the gLRU replacement policy compares favorably
to two other cache replacement policies when simulated on MSR Cambridge Traces
obtained from the SNIA IOTTA repository.Comment: Accepted to ACM TOMPECS, Jun 201
Prefix based Chaining Scheme for Streaming Popular Videos using Proxy servers in VoD
Streaming high quality videos consumes significantly large amount of network
resources. In this context request to service delay, network traffic,
congestion and server overloading are the main parameters to be considered in
video streaming over the internet that effect the quality of service (QoS). In
this paper, we propose an efficient architecture as a cluster of proxy servers
and clients that uses a peer to peer (P2P) approach to cooperatively stream the
video using chaining technique. We consider the following two key issues in the
proposed architecture (1) Prefix caching technique to accommodate more number
of videos close to client (2) Cooperative client and proxy chaining to achieve
the network efficiency. Our simulation results shows that the proposed approach
yields a prefix caching close to the optimal solution minimizing WAN bandwidth
usage on server-proxy path by utilizing the proxy-client and client-client path
bandwidth, which is much cheaper than the expensive server proxy path
bandwidth, server load, and client rejection ratio significantly using
chaining.Comment: 9 pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS 2009, ISSN 1947 5500, Impact Factor 0.423,
http://sites.google.com/site/ijcsis
- …