18,544 research outputs found
Multi-GPU Graph Analytics
We present a single-node, multi-GPU programmable graph processing library
that allows programmers to easily extend single-GPU graph algorithms to achieve
scalable performance on large graphs with billions of edges. Directly using the
single-GPU implementations, our design only requires programmers to specify a
few algorithm-dependent concerns, hiding most multi-GPU related implementation
details. We analyze the theoretical and practical limits to scalability in the
context of varying graph primitives and datasets. We describe several
optimizations, such as direction optimizing traversal, and a just-enough memory
allocation scheme, for better performance and smaller memory consumption.
Compared to previous work, we achieve best-of-class performance across
operations and datasets, including excellent strong and weak scalability on
most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201
Confidentiality-Preserving Publish/Subscribe: A Survey
Publish/subscribe (pub/sub) is an attractive communication paradigm for
large-scale distributed applications running across multiple administrative
domains. Pub/sub allows event-based information dissemination based on
constraints on the nature of the data rather than on pre-established
communication channels. It is a natural fit for deployment in untrusted
environments such as public clouds linking applications across multiple sites.
However, pub/sub in untrusted environments lead to major confidentiality
concerns stemming from the content-centric nature of the communications. This
survey classifies and analyzes different approaches to confidentiality
preservation for pub/sub, from applications of trust and access control models
to novel encryption techniques. It provides an overview of the current
challenges posed by confidentiality concerns and points to future research
directions in this promising field
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
A Survey on Homomorphic Encryption Schemes: Theory and Implementation
Legacy encryption systems depend on sharing a key (public or private) among
the peers involved in exchanging an encrypted message. However, this approach
poses privacy concerns. Especially with popular cloud services, the control
over the privacy of the sensitive data is lost. Even when the keys are not
shared, the encrypted material is shared with a third party that does not
necessarily need to access the content. Moreover, untrusted servers, providers,
and cloud operators can keep identifying elements of users long after users end
the relationship with the services. Indeed, Homomorphic Encryption (HE), a
special kind of encryption scheme, can address these concerns as it allows any
third party to operate on the encrypted data without decrypting it in advance.
Although this extremely useful feature of the HE scheme has been known for over
30 years, the first plausible and achievable Fully Homomorphic Encryption (FHE)
scheme, which allows any computable function to perform on the encrypted data,
was introduced by Craig Gentry in 2009. Even though this was a major
achievement, different implementations so far demonstrated that FHE still needs
to be improved significantly to be practical on every platform. First, we
present the basics of HE and the details of the well-known Partially
Homomorphic Encryption (PHE) and Somewhat Homomorphic Encryption (SWHE), which
are important pillars of achieving FHE. Then, the main FHE families, which have
become the base for the other follow-up FHE schemes are presented. Furthermore,
the implementations and recent improvements in Gentry-type FHE schemes are also
surveyed. Finally, further research directions are discussed. This survey is
intended to give a clear knowledge and foundation to researchers and
practitioners interested in knowing, applying, as well as extending the state
of the art HE, PHE, SWHE, and FHE systems.Comment: - Updated. (October 6, 2017) - This paper is an early draft of the
survey that is being submitted to ACM CSUR and has been uploaded to arXiv for
feedback from stakeholder
Optimization of Analytic Window Functions
Analytic functions represent the state-of-the-art way of performing complex
data analysis within a single SQL statement. In particular, an important class
of analytic functions that has been frequently used in commercial systems to
support OLAP and decision support applications is the class of window
functions. A window function returns for each input tuple a value derived from
applying a function over a window of neighboring tuples. However, existing
window function evaluation approaches are based on a naive sorting scheme. In
this paper, we study the problem of optimizing the evaluation of window
functions. We propose several efficient techniques, and identify optimization
opportunities that allow us to optimize the evaluation of a set of window
functions. We have integrated our scheme into PostgreSQL. Our comprehensive
experimental study on the TPC-DS datasets as well as synthetic datasets and
queries demonstrate significant speedup over existing approaches.Comment: VLDB201
- …