214,752 research outputs found
Parallelization of implicit finite difference schemes in computational fluid dynamics
Implicit finite difference schemes are often the preferred numerical schemes in computational fluid dynamics, requiring less stringent stability bounds than the explicit schemes. Each iteration in an implicit scheme involves global data dependencies in the form of second and higher order recurrences. Efficient parallel implementations of such iterative methods are considerably more difficult and non-intuitive. The parallelization of the implicit schemes that are used for solving the Euler and the thin layer Navier-Stokes equations and that require inversions of large linear systems in the form of block tri-diagonal and/or block penta-diagonal matrices is discussed. Three-dimensional cases are emphasized and schemes that minimize the total execution time are presented. Partitioning and scheduling schemes for alleviating the effects of the global data dependencies are described. An analysis of the communication and the computation aspects of these methods is presented. The effect of the boundary conditions on the parallel schemes is also discussed
A simple parallel prefix algorithm for compact finite-difference schemes
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Phase-coherent lightwave communications with frequency combs
Fiber-optical networks are a crucial telecommunication infrastructure in
society. Wavelength division multiplexing allows for transmitting parallel data
streams over the fiber bandwidth, and coherent detection enables the use of
sophisticated modulation formats and electronic compensation of signal
impairments. In the future, optical frequency combs may replace multiple lasers
used for the different wavelength channels. We demonstrate two novel signal
processing schemes that take advantage of the broadband phase coherence of
optical frequency combs. This approach allows for a more efficient estimation
and compensation of optical phase noise in coherent communication systems,
which can significantly simplify the signal processing or increase the
transmission performance. With further advances in space division multiplexing
and chip-scale frequency comb sources, these findings pave the way for compact
energy-efficient optical transceivers.Comment: 17 pages, 9 figure
A parallel framework for in-memory construction of term-partitioned inverted indexes
Cataloged from PDF version of article.With the advances in cloud computing and huge RAMs provided by 64-bit architectures, it is possible to tackle large problems using memory-based solutions. Construction of term-based, partitioned, parallel inverted indexes is a communication intensive task and suitable for memory-based modeling. In this paper, we provide an efficient parallel framework for in-memory construction of term-based partitioned, inverted indexes. We show that, by utilizing an efficient bucketing scheme, we can eliminate the need for the generation of a global vocabulary. We propose and investigate assignment schemes that can reduce the communication overheads while minimizing the storage and final query processing imbalance. We also present a study on how communication among processors should be carried out with limited communication memory in order to reduce the total inversion time. We present several different communication-memory organizations and discuss their advantages and shortcomings. The conducted experiments indicate promising results. © 2012 The Author. Published by Oxford University Press on behalf of The British Computer Society
Performance Modeling and Analysis of a Massively Parallel DIRECT— Part 1
Modeling and analysis techniques are used to investigate
the performance of a massively parallel version
of DIRECT, a global search algorithm widely used
in multidisciplinary design optimization applications.
Several highdimensional
benchmark functions and
real world problems are used to test the design effectiveness
under various problem structures. Theoretical
and experimental results are compared for two
parallel clusters with different system scale and network
connectivity. The present work aims at studying
the performance sensitivity to important parameters
for problem configurations, parallel schemes,
and system settings. The performance metrics
include the memory usage, load balancing, parallel
efficiency, and scalability. An analytical bounding
model is constructed to measure the load balancing
performance under different schemes. Additionally,
linear regression models are used to characterize
two major overhead sources—interprocessor communication
and processor idleness, and also applied
to the isoefficiency functions in scalability analysis.
For a variety of highdimensional
problems and large
scale systems, the massively parallel design has
achieved reasonable performance. The results of
the performance study provide guidance for efficient
problem and scheme configuration. More importantly,
the generalized design considerations and
analysis techniques are beneficial for transforming
many global search algorithms to become effective
large scale parallel optimization tools
- …