98 research outputs found
Solving the Ghost-Gluon System of Yang-Mills Theory on GPUs
We solve the ghost-gluon system of Yang-Mills theory using Graphics
Processing Units (GPUs). Working in Landau gauge, we use the Dyson-Schwinger
formalism for the mathematical description as this approach is well-suited to
directly benefit from the computing power of the GPUs. With the help of a
Chebyshev expansion for the dressing functions and a subsequent appliance of a
Newton-Raphson method, the non-linear system of coupled integral equations is
linearized. The resulting Newton matrix is generated in parallel using OpenMPI
and CUDA(TM). Our results show, that it is possible to cut down the run time by
two orders of magnitude as compared to a sequential version of the code. This
makes the proposed techniques well-suited for Dyson-Schwinger calculations on
more complicated systems where the Yang-Mills sector of QCD serves as a
starting point. In addition, the computation of Schwinger functions using GPU
devices is studied.Comment: 19 pages, 7 figures, additional figure added, dependence on
block-size is investigated in more detail, version accepted by CP
Solving Lattice QCD systems of equations using mixed precision solvers on GPUs
Modern graphics hardware is designed for highly parallel numerical tasks and
promises significant cost and performance benefits for many scientific
applications. One such application is lattice quantum chromodyamics (lattice
QCD), where the main computational challenge is to efficiently solve the
discretized Dirac equation in the presence of an SU(3) gauge field. Using
NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector
product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double,
single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have
developed a new mixed precision approach for Krylov solvers using reliable
updates which allows for full double precision accuracy while using only single
or half precision arithmetic for the bulk of the computation. The resulting
BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations
until convergence, perform better than the usual defect-correction approach for
mixed precision.Comment: 30 pages, 7 figure
Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU
Oscillation probability calculations are becoming increasingly CPU intensive
in modern neutrino oscillation analyses. The independency of reweighting
individual events in a Monte Carlo sample lends itself to parallel
implementation on a Graphics Processing Unit. The library "Prob3++" was ported
to the GPU using the CUDA C API, allowing for large scale parallelized
calculations of neutrino oscillation probabilities through matter of constant
density, decreasing the execution time by a factor of 75, when compared to
performance on a single CPU.Comment: Final Update: Post submission update Updated version: quantified the
difference in event rates for binned and event-by-event reweighting with a
typical binning scheme. Improved formatting of reference
APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters
We describe herein the APElink+ board, a PCIe interconnect adapter featuring
the latest advances in wire speed and interface technology plus hardware
support for a RDMA programming model and experimental acceleration of GPU
networking; this design allows us to build a low latency, high bandwidth PC
cluster, the APEnet+ network, the new generation of our cost-effective,
tens-of-thousands-scalable cluster network architecture. Some test results and
characterization of data transmission of a complete testbench, based on a
commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2
Fine-grained bit-flip protection for relaxation methods
[EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana OrtÃ, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., MartÃn, A. F., Mayo, R., & Quintana-OrtÃ, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-
Simulation of reaction-diffusion processes in three dimensions using CUDA
Numerical solution of reaction-diffusion equations in three dimensions is one
of the most challenging applied mathematical problems. Since these simulations
are very time consuming, any ideas and strategies aiming at the reduction of
CPU time are important topics of research. A general and robust idea is the
parallelization of source codes/programs. Recently, the technological
development of graphics hardware created a possibility to use desktop video
cards to solve numerically intensive problems. We present a powerful parallel
computing framework to solve reaction-diffusion equations numerically using the
Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion
problems, (i) diffusion of chemically inert compound, (ii) Turing pattern
formation, (iii) phase separation in the wake of a moving diffusion front and
(iv) air pollution dispersion were solved, and additionally both the Shared
method and the Moving Tiles method were tested. Our results show that parallel
implementation achieves typical acceleration values in the order of 5-40 times
compared to CPU using a single-threaded implementation on a 2.8 GHz desktop
computer.Comment: 8 figures, 5 table
Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training
Training a state-of-the-art deep neural network (DNN) is a
computationally-expensive and time-consuming process, which incentivizes deep
learning developers to debug their DNNs for computational performance. However,
effectively performing this debugging requires intimate knowledge about the
underlying software and hardware systems---something that the typical deep
learning developer may not have. To help bridge this gap, we present Skyline: a
new interactive tool for DNN training that supports in-editor computational
performance profiling, visualization, and debugging. Skyline's key contribution
is that it leverages special computational properties of DNN training to
provide (i) interactive performance predictions and visualizations, and (ii)
directly manipulatable visualizations that, when dragged, mutate the batch size
in the code. As an in-editor tool, Skyline allows users to leverage these
diagnostic features to debug the performance of their DNNs during development.
An exploratory qualitative user study of Skyline produced promising results;
all the participants found Skyline to be useful and easy to use.Comment: 14 pages, 5 figures. Appears in the proceedings of UIST'2
- …