998 research outputs found
Fast Parallel Randomized QR with Column Pivoting Algorithms for Reliable Low-rank Matrix Approximations
Factorizing large matrices by QR with column pivoting (QRCP) is substantially
more expensive than QR without pivoting, owing to communication costs required
for pivoting decisions. In contrast, randomized QRCP (RQRCP) algorithms have
proven themselves empirically to be highly competitive with high-performance
implementations of QR in processing time, on uniprocessor and shared memory
machines, and as reliable as QRCP in pivot quality.
We show that RQRCP algorithms can be as reliable as QRCP with failure
probabilities exponentially decaying in oversampling size. We also analyze
efficiency differences among different RQRCP algorithms. More importantly, we
develop distributed memory implementations of RQRCP that are significantly
better than QRCP implementations in ScaLAPACK.
As a further development, we introduce the concept of and develop algorithms
for computing spectrum-revealing QR factorizations for low-rank matrix
approximations, and demonstrate their effectiveness against leading low-rank
approximation methods in both theoretical and numerical reliability and
efficiency.Comment: 11 pages, 14 figures, accepted by 2017 IEEE 24th International
Conference on High Performance Computing (HiPC), awarded the best paper priz
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Task mapping on a dragonfly supercomputer
The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%
Securing Internet Protocol (IP) Storage: A Case Study
Storage networking technology has enjoyed strong growth in recent years, but
security concerns and threats facing networked data have grown equally fast.
Today, there are many potential threats that are targeted at storage networks,
including data modification, destruction and theft, DoS attacks, malware,
hardware theft and unauthorized access, among others. In order for a Storage
Area Network (SAN) to be secure, each of these threats must be individually
addressed. In this paper, we present a comparative study by implementing
different security methods in IP Storage network.Comment: 10 Pages, IJNGN Journa
- …