27,771 research outputs found
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
Platform independent profiling of a QCD code
The supercomputing platforms available for high performance computing based
research evolve at a great rate. However, this rapid development of novel
technologies requires constant adaptations and optimizations of the existing
codes for each new machine architecture. In such context, minimizing time of
efficiently porting the code on a new platform is of crucial importance. A
possible solution for this common challenge is to use simulations of the
application that can assist in detecting performance bottlenecks. Due to
prohibitive costs of classical cycle-accurate simulators, coarse-grain
simulations are more suitable for large parallel and distributed systems. We
present a procedure of implementing the profiling for openQCD code [1] through
simulation, which will enable the global reduction of the cost of profiling and
optimizing this code commonly used in the lattice QCD community. Our approach
is based on well-known SimGrid simulator [2], which allows for fast and
accurate performance predictions of HPC codes. Additionally, accurate
estimations of the program behavior on some future machines, not yet accessible
to us, are anticipated
Massively parallel quantum computer simulator, eleven years later
A revised version of the massively parallel simulator of a universal quantum
computer, described in this journal eleven years ago, is used to benchmark
various gate-based quantum algorithms on some of the most powerful
supercomputers that exist today. Adaptive encoding of the wave function reduces
the memory requirement by a factor of eight, making it possible to simulate
universal quantum computers with up to 48 qubits on the Sunway TaihuLight and
on the K computer. The simulator exhibits close-to-ideal weak-scaling behavior
on the Sunway TaihuLight,on the K computer, on an IBM Blue Gene/Q, and on Intel
Xeon based clusters, implying that the combination of parallelization and
hardware can track the exponential scaling due to the increasing number of
qubits. Results of executing simple quantum circuits and Shor's factorization
algorithm on quantum computers containing up to 48 qubits are presented.Comment: Substantially rewritten + new data. Published in Computer Physics
Communicatio
- …