39,916 research outputs found
Middleware-based Database Replication: The Gaps between Theory and Practice
The need for high availability and performance in data management systems has
been fueling a long running interest in database replication from both academia
and industry. However, academic groups often attack replication problems in
isolation, overlooking the need for completeness in their solutions, while
commercial teams take a holistic approach that often misses opportunities for
fundamental innovation. This has created over time a gap between academic
research and industrial practice.
This paper aims to characterize the gap along three axes: performance,
availability, and administration. We build on our own experience developing and
deploying replication systems in commercial and academic settings, as well as
on a large body of prior related work. We sift through representative examples
from the last decade of open-source, academic, and commercial database
replication systems and combine this material with case studies from real
systems deployed at Fortune 500 customers. We propose two agendas, one for
academic research and one for industrial R&D, which we believe can bridge the
gap within 5-10 years. This way, we hope to both motivate and help researchers
in making the theory and practice of middleware-based database replication more
relevant to each other.Comment: 14 pages. Appears in Proc. ACM SIGMOD International Conference on
Management of Data, Vancouver, Canada, June 200
Design and Implementation of MPICH2 over InfiniBand with RDMA Support
For several years, MPI has been the de facto standard for writing parallel
applications. One of the most popular MPI implementations is MPICH. Its
successor, MPICH2, features a completely new design that provides more
performance and flexibility. To ensure portability, it has a hierarchical
structure based on which porting can be done at different levels. In this
paper, we present our experiences designing and implementing MPICH2 over
InfiniBand. Because of its high performance and open standard, InfiniBand is
gaining popularity in the area of high-performance computing. Our study focuses
on optimizing the performance of MPI-1 functions in MPICH2. One of our
objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to
achieve high performance. We have based our design on the RDMA Channel
interface provided by MPICH2, which encapsulates architecture-dependent
communication functionalities into a very small set of functions. Starting with
a basic design, we apply different optimizations and also propose a
zero-copy-based design. We characterize the impact of our optimizations and
designs using microbenchmarks. We have also performed an application-level
evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2
implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are
close to the raw performance of the underlying InfiniBand layer. Our study
shows that the RDMA Channel interface in MPICH2 provides a simple, yet
powerful, abstraction that enables implementations with high performance by
exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is
the first high-performance design and implementation of MPICH2 on InfiniBand
using RDMA support.Comment: 12 pages, 17 figure
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
Verification of the Tree-Based Hierarchical Read-Copy Update in the Linux Kernel
Read-Copy Update (RCU) is a scalable, high-performance Linux-kernel
synchronization mechanism that runs low-overhead readers concurrently with
updaters. Production-quality RCU implementations for multi-core systems are
decidedly non-trivial. Giving the ubiquity of Linux, a rare "million-year" bug
can occur several times per day across the installed base. Stringent validation
of RCU's complex behaviors is thus critically important. Exhaustive testing is
infeasible due to the exponential number of possible executions, which suggests
use of formal verification.
Previous verification efforts on RCU either focus on simple implementations
or use modeling languages, the latter requiring error-prone manual translation
that must be repeated frequently due to regular changes in the Linux kernel's
RCU implementation. In this paper, we first describe the implementation of Tree
RCU in the Linux kernel. We then discuss how to construct a model directly from
Tree RCU's source code in C, and use the CBMC model checker to verify its
safety and liveness properties. To our best knowledge, this is the first
verification of a significant part of RCU's source code, and is an important
step towards integration of formal verification into the Linux kernel's
regression test suite.Comment: This is a long version of a conference paper published in the 2018
Design, Automation and Test in Europe Conference (DATE
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
SkelCL - A Portable Skeleton Library for High-Level GPU Programming
While CUDA and OpenCL made general-purpose programming for Graphics Processing Units (GPU) popular, using these programming approaches remains complex and error-prone because they lack high-level abstractions. The especially challenging systems with multiple GPU are not addressed at all by these low-level programming models. We
propose SkelCL – a library providing so-called algorithmic skeletons that capture recurring patterns of parallel computation and communication, together with an abstract vector data type and constructs for specifying data distribution. We demonstrate that SkelCL greatly simplifies programming GPU systems. We report the competitive performance results of SkelCL using both a simple Mandelbrot set computation and an industrial-strength medical imaging application. Because the library is implemented using OpenCL, it is portable across GPU hardware of different vendors
- …