62,879 research outputs found
Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning
In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency.
Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications.
Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency.
Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
Production-quality parallel applications are often a mixture of diverse
operations, such as computation- and communication-intensive, regular and
irregular, tightly coupled and loosely linked operations. In conventional
construction of parallel applications, each process performs all the
operations, which might result inefficient and seriously limit scalability,
especially at large scale. We propose a decoupling strategy to improve the
scalability of applications running on large-scale systems.
Our strategy separates application operations onto groups of processes and
enables a dataflow processing paradigm among the groups. This mechanism is
effective in reducing the impact of load imbalance and increases the parallel
efficiency by pipelining multiple operations. We provide a proof-of-concept
implementation using MPI, the de-facto programming system on current
supercomputers. We demonstrate the effectiveness of this strategy by decoupling
the reduce, particle communication, halo exchange and I/O operations in a set
of scientific and data-analytics applications. A performance evaluation on
8,192 processes of a Cray XC40 supercomputer shows that the proposed approach
can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017
Nonorthogonal Quantum States Maximize Classical Information Capacity
I demonstrate that, rather unexpectedly, there exist noisy quantum channels
for which the optimal classical information transmission rate is achieved only
by signaling alphabets consisting of nonorthogonal quantum states.Comment: 5 pages, REVTeX, mild extension of results, much improved
presentation, to appear in Physical Review Letter
Effects of communication efficiency and exit capacity on fundamental diagrams for pedestrian motion in an obscure tunnel|a particle system approach
Fundamental diagrams describing the relation between pedestrians speed
and density are key points in understanding pedestrian dynamics.
Experimental data evidence the onset of complex behaviors in which the
velocity decreases with the density and different logistic regimes are
identified. This paper addresses the issue of pedestrians transport and of fundamental diagrams for a scenario involving the motion of pedestrians
escaping from an obscure tunnel.
% via a simple one--dimensional particle system model.
We capture the effects of the communication efficiency and
the exit capacity by means of two thresholds controlling the rate
at which particles (walkers, pedestrians) move on the lattice.
Using a particle system model, we show that in absence of limitation in communication among
pedestrians we reproduce
with good accuracy the standard fundamental diagrams, whose
basic behaviors can be interpreted in terms of the exit capacity
limitation.
When the effect of a limited communication ability is considered, then
interesting non--intuitive phenomena occur. Particularly, we shed light on
the loss of monotonicity of the typical speed--density curves,
revealing the existence of a
pedestrians density optimizing the escape.
We study both the discrete particle dynamics as well as the corresponding hydrodynamic limit (a porous medium equation and a transport (continuity) equation). We also point out the dependence of the effective transport coefficients on the two thresholds -- the essence of the microstructure information
- …