33,217 research outputs found
Recommended from our members
Security architectures in mobile integrated pay-TV
This paper presents the design and describes the advantage of the state-of-the-art Mobile Integrated Conditional Access System (MICAS) concerning interoperability, personalisation, security and operational costs in Pay-TV systems. The Message Handling Subsystem is proposed and outlined together with âFollow-Meâ service, which proposed herewith to extend mobility and personalisation concepts on Pay-TV service
Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training
Deep Neural Network (DNN) are currently of great inter- est in research and
application. The training of these net- works is a compute intensive and time
consuming task. To reduce training times to a bearable amount at reasonable
cost we extend the popular Caffe toolbox for DNN with an efficient distributed
memory communication pattern. To achieve good scalability we emphasize the
overlap of computation and communication and prefer fine granu- lar
synchronization patterns over global barriers. To im- plement these
communication patterns we rely on the the Global address space Programming
Interface version 2 (GPI-2) communication library. This interface provides a
light-weight set of asynchronous one-sided communica- tion primitives
supplemented by non-blocking fine gran- ular data synchronization mechanisms.
Therefore, Caf- feGPI is the name of our parallel version of Caffe. First
benchmarks demonstrate better scaling behavior com- pared with other
extensions, e.g., the Intel TM Caffe. Even within a single symmetric
multiprocessing machine with four graphics processing units, the CaffeGPI
scales bet- ter than the standard Caffe toolbox. These first results
demonstrate that the use of standard High Performance Computing (HPC) hardware
is a valid cost saving ap- proach to train large DDNs. I/O is an other
bottleneck to work with DDNs in a standard parallel HPC setting, which we will
consider in more detail in a forthcoming paper
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
FastPay: High-Performance Byzantine Fault Tolerant Settlement
FastPay allows a set of distributed authorities, some of which are Byzantine,
to maintain a high-integrity and availability settlement system for pre-funded
payments. It can be used to settle payments in a native unit of value
(crypto-currency), or as a financial side-infrastructure to support retail
payments in fiat currencies. FastPay is based on Byzantine Consistent Broadcast
as its core primitive, foregoing the expenses of full atomic commit channels
(consensus). The resulting system has low-latency for both confirmation and
payment finality. Remarkably, each authority can be sharded across many
machines to allow unbounded horizontal scalability. Our experiments demonstrate
intra-continental confirmation latency of less than 100ms, making FastPay
applicable to point of sale payments. In laboratory environments, we achieve
over 80,000 transactions per second with 20 authorities---surpassing the
requirements of current retail card payment networks, while significantly
increasing their robustness
- âŠ