Search CORE

2,776 research outputs found

Solving Lattice QCD systems of equations using mixed precision solvers on GPUs

Author: Barros
Brannick
Bulava
Bunk
C. Rebbi
Clark
De Forcrand
DeGrand
Edwards
Egri
Holmgren
K. Barros
Kahan
M.A. Clark
Martin
NVIDIA Corporation
R. Babich
R.C. Brower
Sleijpen
Publication venue: 'Elsevier BV'
Publication date: 21/12/2009
Field of study

Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodyamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.Comment: 30 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Resolution of Linear Algebra for the Discrete Logarithm Problem Using GPU and Multi-core Architectures

Author: A. Joux
B. Schmidt
C. Lanczos
C. Pomerance
D. Coppersmith
D.H. Wiedemann
E. Thomé
K. Aoki
L.M. Adleman
R. Barbulescu
R. Barbulescu
T. ElGamal
T. Kleinjung
V. Strassen
W. Diffie
Publication venue
Publication date: 01/01/2014
Field of study

In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. The index-calculus methods, that attack the DLP in multiplicative subgroups of finite fields, require solving large sparse systems of linear equations modulo large primes. This article deals with how we can run this computation on GPU- and multi-core-based clusters, featuring InfiniBand networking. More specifically, we present the sparse linear algebra algorithms that are proposed in the literature, in particular the block Wiedemann algorithm. We discuss the parallelization of the central matrix--vector product operation from both algorithmic and practical points of view, and illustrate how our approach has contributed to the recent record-sized DLP computation in GF(

2^{809}

).Comment: Euro-Par 2014 Parallel Processing, Aug 2014, Porto, Portugal. \<http://europar2014.dcc.fc.up.pt/\&gt

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Complexity transitions in global algorithms for sparse linear systems over finite fields

Author: A Braunstein
Alava M J
Broder A Z
Cocco S
Cormen T H
Creignou N
F Ricci-Tersenghi
Garey M
Kolchin V F
Leone M
Leone M
M Leone
Mézard M
Papadimitriou C H
Pomerance C
R Zecchina
Rivest R L
Schaefer T J
Sourlas N
Publication venue: 'IOP Publishing'
Publication date: 01/01/2002
Field of study

We study the computational complexity of a very basic problem, namely that of finding solutions to a very large set of random linear equations in a finite Galois Field modulo q. Using tools from statistical mechanics we are able to identify phase transitions in the structure of the solution space and to connect them to changes in performance of a global algorithm, namely Gaussian elimination. Crossing phase boundaries produces a dramatic increase in memory and CPU requirements necessary to the algorithms. In turn, this causes the saturation of the upper bounds for the running time. We illustrate the results on the specific problem of integer factorization, which is of central interest for deciphering messages encrypted with the RSA cryptosystem.Comment: 23 pages, 8 figure

arXiv.org e-Print Archive

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio della ricerca- Università di Roma La Sapienza

PORTO Publications Open Repository TOrino

Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform

Author: Gupta Indivar
Rawal Shruti
Publication venue: 'Defence Scientific Information and Documentation Centre'
Publication date: 06/12/2022
Field of study

We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU

Defence Science Journal