Resolution of Linear Algebra for the Discrete Logarithm Problem Using
  GPU and Multi-core Architectures by Jeljeli, Hamza
ar
X
iv
:1
40
2.
36
61
v2
  [
cs
.C
R]
  4
 D
ec
 20
14
Resolution of Linear Algebra for the Discrete
Logarithm Problem Using GPU and Multi-core
Architectures
Hamza Jeljeli
CARAMEL project-team, LORIA, INRIA / CNRS / Université de Lorraine,
Campus Scientifique, BP 239, 54506 Vandœuvre-lès-Nancy Cedex, France
Hamza.Jeljeli@loria.fr
Abstract. In cryptanalysis, solving the discrete logarithm problem (DLP)
is key to assessing the security of many public-key cryptosystems. The
index-calculus methods, that attack the DLP in multiplicative subgroups
of finite fields, require solving large sparse systems of linear equations
modulo large primes. This article deals with how we can run this com-
putation on GPU- and multi-core-based clusters, featuring InfiniBand
networking. More specifically, we present the sparse linear algebra algo-
rithms that are proposed in the literature, in particular the block Wiede-
mann algorithm. We discuss the parallelization of the central matrix–
vector product operation from both algorithmic and practical points
of view, and illustrate how our approach has contributed to the recent
record-sized DLP computation in GF(2809).
Keywords: Discrete logarithm problem, sparse linear algebra, parallel comput-
ing, GPU acceleration, multi-core processors, InfiniBand.
1 Introduction
The security of several public-key cryptosystems and protocols relies on the
hardness of the computation of the discrete logarithm problem (DLP) in a given
cyclic group [?]. To name but a few, we can mention the Diffie–Hellman key
exchange protocol [?], the ElGamal encryption system [?] or the pairing-based
cryptography [?].
In this context, a family of algorithms, known as index-calculus methods,
is used to attack the DLP on finite fields. The majority of these algorithms
propose to solve it in time sub-exponential in the size of the finite field. While
a stream of recent algorithmic improvements for fields of small characteristic,
including a quasi-polynomial algorithm [?], have produced several record-sized
computations [?], the sub-exponential methods appear to be most competitive
for fields of prime extension degree, at least so far.
Index calculus algorithms require solving large sparse systems of linear equa-
tions over finite fields. It is important to mention that, most considerations and
methods in the case of numerical computations do not apply here. Several pa-
pers have focused on efficient implementations of sparse linear algebra over finite
2 Hamza Jeljeli
fields. For instance, Schmidt et al. [?] treated linear algebra over GF(2) for in-
teger factorization; Boyer et al. [?] worked on the case of small finite rings and
fields.
Problem Statement. Let GF(q) be the field in which the DLP is to be solved.
The linear algebra is performed modulo a large prime ℓ that divides q − 1. We
consider ℓ between 160 and 650 bits, along with an N -by-N sparse matrix A
defined over Z/ℓZ. The size N ranges from hundreds of thousands to millions.
Each row of A contains O
(
(logN)2
)
non-zero coefficients. The very first columns
of A are relatively dense, then the column density decreases gradually. The row
density does not change significantly (cf. Figure 2). The so-called linear algebra
step in the DLP computation consists in finding a non-trivial vectorw ∈ (Z/ℓZ)N
such that Aw = 0.
We assume that we have access to one or several high-performance comput-
ing clusters, containing multi-core CPUs and/or GPUs, interconnected by fast
communication links (typically InfiniBand). We want to optimize the use of these
resources in order to solve the linear algebra problem efficiently. In particular,
we aim to minimize the overall wall-clock time for solving the problem. First, at
an algorithmic level, we study how these heavy computations can be distributed
into smaller parallel subtasks. Then, we focus on more practical concerns, for
instance the communication within these different subtasks.
Organization. This article is organized as follows: Section 2 gives an overview
of the relevant algorithms for sparse linear algebra, while we discuss the par-
allelization of the matrix–vector product operation and focus on the communi-
cation concerns in Section 3. Finally, Section 4 details how our implementation
has been used in concrete DLP computations with different hardware setups.
2 Algorithms for Sparse Linear Algebra
To solve systems of linear equations, two families of algorithms are available:
direct methods, such as Gaussian elimination or LU/QR decompositions, and
iterative methods, such as the conjugate gradient method and, in the context of
linear algebra over finite fields, the Lanczos [?] and Wiedemann [?] algorithms.
The first set of algorithms requires O (Nω) field operations, where ω is, for
implementation concerns, 2.81 at best using the Strassen algorithm for matrix
multiplication [?]. However, these methods tend to densify the matrix, which
quickly raises storage issues. The second set of algorithms does not modify the
matrix and requires O (N) sparse-matrix–vector products (SpMVs). As long as
an SpMV can be performed faster than O
(
Nω−1
)
field operations, the itera-
tive methods are asymptotically faster. This condition is reasonable, since the
complexity of an SpMV is O (Nγ), where γ is the average number of non-zero co-
efficients per row. From both storage and complexity points of view, the iterative
methods appear to be more suited to sparse linear algebra.
Still, in our case, despite the fact that the matrix is extremely sparse, the
cost of an iterative solver remains high because the matrix is very large. The
exact nature of the computation calls for no less than N iterations, or a number
proportional to N depending on some fine points. The approach following is
Resolution of Linear Algebra for DLP Using Parallel Architectures 3
applied to tackle that problem. First, a structured Gaussian elimination (SGE)
is run as a preprocessing step so as to reduce the size of the matrix [?]; then an
iterative solver is used. Although the Gaussian elimination increases the average
row weight, it nevertheless allows us to decrease the cost of the iterative solver
and to reduce the amount of required memory, which is a major implementation
concern as will be seen in the following. It is important that we stop the SGE
when the projected cost of the iterative solver starts to increase again or when
memory requirements are small enough so as to fit on the hardware at hand [?].
The Lanczos and Wiedemann algorithms are the most commonly used it-
erative algorithms in the context of finite fields linear algebra. The Lanczos
algorithm is known to have a better complexity than the Wiedemann algorithm.
However, the block extension of Wiedemann algorithm (a.k.a block Wiedemann)
offers the opportunity to split the computation into several independent sub-
tasks, which is an important practical advantage [?,?].
The Wiedemann algorithm and block Wiedemann algorithms return both a
vector w of the kernel of A. This vector is non-trivial with high probability. In
practice, a single run of the solver is sufficient to find an appropriate solution.
Wiedemann algorithm The starting point of the Wiedemann algorithm is to
choose two random vectors x, y ∈ (Z/ℓZ)N . The algorithm is organized in three
steps [?], for which we use monikers borrowed from the CADO-NFS software
implementation [?].
– The first step computes the first 2N terms of the linearly recurrent sequence
(ai)i∈N ∈ (Z/ℓZ)
N, where ai =
txAiy. This step is usually called Krylov .
– Then, thanks to the Berlekamp–Massey algorithm, we compute the minimal
polynomial of the sequence, which is the polynomial F (X) =
∑d
i=0 fiX
i of
lowest degree d such that
∑d
i=0 fiak+i = 0 for all k ≥ 0. The degree d is
close to N . We commonly call this step Lingen .
– The last step, called Mksol , finally computes w = F (A)y.
The Wiedemann algorithm requires 3N SpMVs for the Krylov and Mksol
steps and O (N logN) field operations for the Lingen step.
BlockWiedemann algorithm Wiedemann algorithm is fully sequential. In [?,?],
Coppersmith et al. presented a block variant that provides parallelism. The block
Wiedemann algorithm replaces the vector y ∈ (Z/ℓZ)N by a block of n vectors
y(0), . . . , y(n−1), each in (Z/ℓZ)N , and similarly uses a block of m vectors for x.
The sequence of scalars ai is thus replaced by a sequence of m-by-n matrices.
There is a complete freedom in the choice of the blocking parameters (m,n). For
the efficiency of the Lingen run, m is chosen to be equal to 2n [?].
– The Krylov step now computes the first
⌈
N
n
⌉
+
⌈
N
m
⌉
terms of the sequence
(ai)i∈N. Notice that the j-th column of the m-by-n matrix
txAiy depends
only on the j-th column of the block vector y. Thus, the computation of(
txAiy
)
i∈N
can be distributed into n parallel tasks, each computing
(
txAiy(j)
)
i∈N
. These tasks need no synchronization nor communication, ex-
cept at the end when all their results are combined.
4 Hamza Jeljeli
– The Lingen step seeks a linear generator for the previous sequence. The com-
plexity of this step becomes O
(
nω−1N logN
)
with m = 2n = o (logN) [?].
The output of Lingen is composed of n generators F (0), . . . , F (n−1), each of
them a polynomial over Z/ℓZ of degree less than
⌈
N
n
⌉
.
– The Mksol step computes the following element of the null-space of A: w =∑n
j=1 F
(j)(A)y(j). Similarly to the Krylov phase, the computation can be
distributed into n independent computations.
In the rest of the paper, we focus on the Krylov and Mksol steps, as they
dominate the overall cost and can benefit from parallel hardware. For the Lingen
computation, we use the CADO-NFS software [?].
3 The Matrix–Vector Product
The Lingen step complexity depends roughly quadratically on the blocking pa-
rameter n. Therefore, we can not increase too much the blocking parameters
(n,m). We observe also that the block Wiedemann algorithm does not distribute
the matrix–vector product, so it does not reduce the amount of required mem-
ory per node. Thus, the parallelism provided by the block Wiedemann algorithm
is soon limited. We need to explore how to carry out a Krylov/Mksol task on
more than one computation node. Typically, this is related to performing each
matrix–vector product in parallel on many computation nodes. In this section,
we study how to accelerate this major operation on parallel hardware.
We assume that we have a set of identical computing nodes organized ac-
cording to a 2D rectangular grid and interconnected by a network. Each node is
identified by its coordinates (i, j) in the grid. At this level, we ignore the nature
of the nodes. The nodes could be cores within a machine, independent machines
or GPUs. The matrix A is split into square parts of equal size, such that each
node (i, j) gets the part Ai,j .
3.1 Communication/Computation scheme
An SpMV iteration takes an input vector u and computes v = Au. At the
beginning of an iteration, a node (i, j) holds the sub-matrix Aij and the j-th
fragment uj of the input vector u. The nodes collaborate together to compute the
output vector, which will be the input vector to the next iteration. To be able to
run the next iteration, the node Aij only needs to know the j-th fragment vj of
the output vector v. More specifically, the parallel SpMV product is performed
as follows.
1. Each node (i, j) computes the partial SpMV Aijuj.
2. Each diagonal node (i, i) collects and sums the partial results from the nodes
of the row i. The sum corresponds to the i-th fragment of v.
3. Each diagonal node (i, i) broadcasts its fragment vj to the nodes of the
column i.
In Figure 1, we give an example of a run for 4 parallel nodes with a 2 × 2
split of the matrix. In this figure, the 4 nodes are, represented in gray, numbered
from 0 to 3. On the left-hand side, we indicate how the matrix A and the input
Resolution of Linear Algebra for DLP Using Parallel Architectures 5
A00
0 A01
1
A10
2 A11
3
u0
0
2
u1
1
3
v0
v1
u0 u1 u0 u1
SpMV
partial
SpMV
partial
partial
SpMV
partial
SpMV
0
(0, 0)
1
(0, 1)
2
(1, 0)
3
(1, 1)
A00u0 A01u1 A10u0 A11u1
+
+
v0 v1
v0 v1 v0 v1
Initial state
SpMV
Reduction
Broadcast
Fig. 1: Computation/Communication scheme for a 2× 2 split of A
vector u are distributed among the nodes. We detail on the right-hand side the
intermediate data present on each node after each step.
The communication scheme suffers from the fact that only one node per
row collects the partial products. A parallelization of the Reduction/Broadcast
operations is possible, typically using the ReduceScatter/AllGather operations.
This should yield to a significant speedup of the communication delay. However,
the output of the iteration will be permuted, i.e., the fragments of v will not be
distributed as were those of u in the beginning of the iteration. In summary, it
remains an improvement that can be explored.
3.2 Balancing the workload
The particular distribution of the non-zero coefficients is such that the nodes
will get unbalanced workloads, and the nodes working on the denser parts will
take more time than those working on the sparser ones. For the particular kind
of input, this unbalance problem can fortunately be solved efficiently. To fix this
problem, we apply permutations of the rows and columns, so that the distribu-
tion of non-zero coefficients for each sub-matrix is close to that of the matrix A,
as shown in Figure 2. One possibility to obtain this permutation is to sort the
columns by their weight and distribute them evenly among the nodes, then pro-
ceed likewise with the rows. This is made possible by the fact that the standard
deviation of the row weight is much smaller than that of the column weight.
3.3 The partial SpMV
The matrix is stored in a sparse format, adapted from the Compressed Sparse
Row (CSR) format for the particular distribution of the non-zero coefficients.
We chose to implement the arithmetic operations in Z/ℓZ using the Residue
Number System (RNS). The use of this representation for finite field arithmetic
6 Hamza Jeljeli
Initial Balanced
Fig. 2: Distribution of non-zero coefficients for initial and balanced matrices
provides a fine grained parallelism, which can be exploited by Single Instruction,
Multiple Data (SIMD) architectures.
In the remainder of the article, we consider the partial matrix–vector product
as a black box, that is, a subroutine which, on inputs A and u returns the product
Au. We give more details about how this subroutine is implemented in [?].
3.4 Communication concerns
We now focus on how to share data between the computing nodes, in the cases
of CPU nodes and GPU nodes.
CPU communications The case of CPU-only setups is quite straightforward,
as we use the MPI operation MPI_Reduce to collect and combine on a diagonal
node the results of nodes belonging to the same row, and MPI_Bcast to broadcast
the combined results to the nodes of each column. In the following subsections,
we assume that we execute the application over a cluster of GPUs and we discuss
the data movement. We restrict to NVIDIA graphics hardware. Distributing an
SpMV on several GPUs requires considering two possible (and not mutually
exclusive) cases: the first one where a single CPU node harbors two or more
GPUs, and the second one where the GPUs are in different CPU nodes.
Intra-node GPU communications We are in the case of sharing data be-
tween two GPUs within the same CPU node. In order to do so, CUDA, the
parallel programming model for NVIDIA GPUs [?] offers three possibilities:
– Staging through CPU: the communication has to involve the host CPU.
Thus, it is composed of two transfers, a device-to-host copy (D2H) then a
host-to-device copy (H2D).
– Device-to-device copy (D2D): from the programmer’s perspective, it is a
direct copy of the GPU buffers. Although the transfer still passes through
the host memory, the copy is fully pipelined.
– Peer-to-Peer Direct Access (P2P DMA): using this feature, the devices can
share data independently of the CPU. P2P DMA requires to enable peer
access for each GPU, which is supported by recent hardware.
The P2P DMA feature should decrease the host overhead and thus accelerate
the memory copies. To verify it, we ran benchmarks to compare the bandwidth
and latency of each approach (cf. Figure 3). The experiment is performed using
two NVIDIA GeForce GTX 680 cards. The benchmarks measure the run time
for sending messages of increasing size from one GPU to the other. The latencies
for the first two options are 19.7µs and 19.4µs, respectively, and only 14µs when
the P2P DMA is enabled. The peak bandwidths are 6.1 GB/s for the explicit
host staging transfer, 7.3 GB/s for the device to device transfer, and 10.4 GB/s
for the P2P DMA transfer.
Resolution of Linear Algebra for DLP Using Parallel Architectures 7
10−1 100 101 102 103 104 105 106
0
2
4
6
8
10
12
14
Buffer size (kB)
B
a
n
d
w
id
th
(G
B
/
s)
D2H + H2D
D2D
D2D (P2P DMA)
GPU0 GPU1
Chipset
Host Memory
PCIe PCIe
Fig. 3: Benchmarking Intra-node GPU communications
Inter-node GPU communications Now, we are interested in the case of
sharing data between GPUs installed in different CPU nodes. The trivial option
in this case is to perform the transfer in three steps: a data copy from device
to host using CUDA routines, then use MPI to copy data between hosts, and
finally a CUDA copy from host to device on the destination node (cf. Figure 4).
It is however possible to overcome the host staging using the Cuda-aware
MPI feature which combines MPI and CUDA. It allows one to address GPU
buffers directly in the MPI routines (cf. Figure 5). From the programmer’s point
of view, a data transfer boils down to one call to an MPI routine. With Cuda-
aware MPI, the data transfers are fully pipelined, while without the feature, the
transfers between hosts and those between the device and the host are pipelined
separately. The Cuda-aware MPI feature is incorporated in several widely used
MPI libraries and considerably improves the data movement latencies.
GPU0 GPU1IB IB
Chipset Chipset
Host0 Memory Host1 Memory
CUDA
buffer
CUDA
buffer
CPU
buffer
CPU
buffer
PCIe PCIe
IB
GPU to
CPU copy
CPU to CPU transfer CPU to
GPU copy
Fig. 4: Data copy from GPU0 to GPU1 without Cuda-aware MPI
In Figure 6, we report the results of bandwidth benchmarks for inter-node
GPU-to-GPU communications. We ran the experiment using two NVIDIA GTX
680 installed in two nodes connected with QDR InfiniBand. We use CUDA 5.0
and Open MPI 1.7.3. In addition to benchmarks for the two ways of communica-
tion, we added the Host-to-Host (H2H) communication results as a reference, for
8 Hamza Jeljeli
GPU0 GPU1IB IB
Chipset Chipset
Host0 Memory Host1 Memory
CUDA
buffer
CUDA
buffer
CPU
buffer
CPU
buffer
PCIe PCIe
IB
Fig. 5: Data copy from GPU0 to GPU1 with Cuda-aware MPI
which we measured the data movement from one CPU buffer to another CPU
buffer using the regular MPI routines.
The latency of a plain Device-to-Device transfer is 11µs. It becomes 9 µs if the
feature Cuda-aware MPI is used. The latency of the Host-to-Host transfer is 1µs.
Without Cuda-aware MPI, the bandwidth is bounded by 2.3 GB/s. The Cuda-
aware MPI feature allows to reach the Host-to-Host peak bandwidth, which is
3.7 GB/s.
10−1 100 101 102 103 104 105 106
0
1
2
3
4
5
6
Buffer size (kB)
B
a
n
d
w
id
th
(G
B
/
s)
D2D (without Cuda-aware MPI)
D2D (with Cuda-aware MPI)
H2H
Fig. 6: Benchmarking Inter-node GPU communications
Another feature that further optimizes data transfers is GPU Direct. The
GPU Direct offers lower latency for moving data compared to transfers staged
through the host. However, its bandwidth is significantly limited. We could not
deploy this feature in our application, as it is supported only by the recent Tesla
and Quadro cards. A comparison of the performance of this feature with the
transfers staged through host can be found in [?].
4 Examples of computations
4.1 DLP in GF(2809)× using FFS
The function field sieve (FFS) [?] is an index-calculus algorithm designed to
attack the DLP in the multiplicative subgroup of a finite field GF(pn), where the
characteristic p is a small prime. Barbulescu et al. announced in [?] the solving
of the DLP in the 202-bit prime order subgroup of GF(2809)× using FFS. This
computation is the largest DLP computation in a binary field extension of prime
degree. The previous record was the computation of a DLP in GF(2613)× [?].
Resolution of Linear Algebra for DLP Using Parallel Architectures 9
Matrix. In this computation, the linear algebra step is performed in Z/ℓZ where
ℓ is 202 bits long. The relation collection phase produced an initial matrix of
78.8M rows. A preliminary structured Gaussian elimination reduced the matrix
to 3,602,667 rows and columns, with an average of 100 non-zero coefficients per
row. Each non-zero coefficient of A fits in a single machine word. Around 90%
of them are ±1 [?], [?].
Linear Algebra Setup. At the time of the computation, we had access to a
4-node cluster, with 2.4 GHz Intel Xeon E5620 Westmere processors connected
with InfiniBand network at 40 Gb/s. Each node is equipped with 2 NVIDIA
Tesla M2050 graphics processors.
The total memory required to handle the matrix along with the input and
output vectors is 3.16 GB. Since the available memory on one card is only
3 GB, the block Widemann configuration (n = 8,m = 16), for which a se-
quence
(
txAiy(j)
)
i∈N
can be computed on a single device, is not feasible. We
have to compute each sequence on more than one device; the configuration (n =
4,m = 8) with a 2 × 1 split of the matrix and the configuration (n = 2,m = 4)
with a 2 × 2 split of the matrix are possible. Theoretically, the former appears
to be the most convenient, since only two GPUs connected to the same node
communicate, while, with the latter, 4 GPUs interconnected with the network
are required to communicate.
In the following table, we detail a comparison between these two configura-
tions. The comparison shows how the inter-node GPU communication for the
second configuration slows down the overall computation time. We also present
benchmarks related to a smaller matrix, for which the three configurations are
possible and a bigger matrix, for which only the (n = 2,m = 4) configuration is
feasible.
Matrix size Possible blocking SpMV + comm. Overall Ratio com.
(required memory) parameters delays per iteration comp. time /iteration
3.6M × 3.6M (n = 4,m = 8) 142 + 27 ms 4.5 days 16%
(3.2 GB) (n = 2, m = 4) 72 + 41 ms 6 days 37%
3M × 3M
(n = 8,m = 16) 228 + 0 ms 2.5 days 0%
(2.7 GB)
(n = 4, m = 8) 115 + 23 ms 3 days 17%
(n = 2, m = 4) 58 + 35 ms 4.1 days 38%
6M × 6M
(n = 2,m = 4) 123 + 69 ms 16.7 days 36%
(5.4 GB)
With the (n = 4,m = 8) blocking parameters, an iteration takes 169 ms on
each node, including 27 ms for the GPU communications. The initial sequence
computation required 2.6 days in parallel on the 4 nodes. The linear generator
computation was carried out in parallel using 16 threads running on 16 CPU
cores. It required 2 hours. Finally, computing the kernel vector required 1.8 days
in parallel on the 4 GPU nodes. The overall computation took a total wall-clock
time of about 4.5 days.
10 Hamza Jeljeli
4.2 DLP in a 596-bit prime field using NFS
To compute discrete logarithms in a prime field GF(p), the Number Field Sieve
(NFS) algorithm is used [?]. The last NFS record was accomplished by T. Klein-
jung et al. [?] for a 530-bit (160-decimal-digit) prime p using NFS. We are
currently running an NFS-based computation to attack the DLP in a 596-bit
(180-decimal-digit) prime field. The linear algebra step is defined over a 595-bit
prime ℓ.
Matrix. The matrix contains 179M rows at the end of the relation collection.
The preliminary structured Gaussian elimination reduced the number of rows
to 7,287,476, with an average weight of 150 non-zero coefficients per row. The
matrices issued from NFS computations contain a small number (5, here) of
dense columns, whose elements live in Z/ℓZ. The rest of the matrix is similar
to an FFS matrix in terms of distribution and coefficient size. Taking this dense
part into account adds a non-negligible cost when compared to FFS matrices.
GPU Setup. For this computation, we have access to 8 NVIDIA GeForce GTX
680 graphics processors, plugged into a 4-node cluster of Intel Xeon E5-2609 pro-
cessors running at 2.4 GHz, and connected with QDR Infiniband network. Each
graphics card has 4 GB of memory. The total memory required to carry out the
SpMV on one GPU is 9.8 GB. Thus, 4 GPUs should work on a single sequence,
i.e., at most two sequences can be computed in parallel, and the blocking param-
eters are (n = 2,m = 4). An iteration takes 615 ms on each group of 4 GPUs,
with 195 ms for the GPU-to-GPU communications. The overall computation
should take a total wall-clock time of about 65 days.
CPU Setup. Another option was tried, using our CPU implementation on a
768-core cluster. The cluster contains 48 nodes connected with FDR Infiniband.
Each node hosts two 2-GHZ 8-core Intel Xeon E5-2650 processors. With this
setup, we propose an 8×8 split of the matrix, so that 64 MPI processes running on
4 nodes work together to carry out a matrix–vector product, each process running
on one core. This yields to a (n = 12,m = 24) block Wiedemann configuration.
The processes are distributed so that the processes running on the same node
are contiguous. This allows to accelerate the reduction/broadcast operations,
since data sharing between threads belonging to the same node is performed on
shared memory, not across the network. A speedup of 2.4 on the communication
delay is observed when comparing with the default MPI processes mapping.
In the following table, we compare the GPU and CPU setups. We observe
that starting from a certain matrix size and with these setups, the multi-core
acceleration prevails over the GPU one. For comparison, we add a setup, where
we have a cluster similar to the 48-node multi-core cluster, but containing two
NVIDIA GeForce GTX 680 on each node. This setup is not speculative, the
GPU setup obtained with 8 GPUs scales perfectly to 96 GPUs, thanks to the
cost-free distribution of block Wiedemann algorithm.
Resolution of Linear Algebra for DLP Using Parallel Architectures 11
Matrix size
Setup
Blocking SpMV+com. Overall Ratio
(Memory) parameters delays [ms] comp. time com.
7.3M × 7.3M
8 GPUs (n = 2, m = 4)
420 + 195 65 days 32%
(9.8 GB)
on 4 nodes 4 GPUs ↔ 1 subtask
768 cores (n = 12,m = 24)
1700 + 400 39 days 19%
on 48 nodes 64 cores ↔ 1 subtask
96 GPUs (n = 24, m = 48)
420 + 195 5.5 days 32%
on 48 nodes 4 GPUs ↔ 1 subtask
With the CPU setup, an iteration is performed in 2.1 s by the 64 parallel
threads, including 0.4 ms for communications. The Krylov phase required 22
in the 768-core cluster, which is equivalent to 46-core years. The Lingen phase
required 15 hours running on 144 cores. The Mksol phase took 16 days in the
768-core cluster (i.e. 34-core years). The overall computation required around
80-core years.
5 Conclusion
In this article, we presented how the block solvers, in our case block Wiedemann
algorithm, distribute heavy computations without an additional overhead. We
discussed a further parallelization of the matrix-vector product and detailed
how we can efficiently run this computation in a cluster of GPUs or CPUs. In
the examples that we ran, we did not combine the two architectures on the
same computation. However, our final implementation can be run on a hybrid
GPU/Multi-core architecture.
Acknowledgments The author is grateful to Jérémie Detrey and Emmanuel
Thomé. This work would not be possible without their support. We also thank
the reviewers for their valuable comments.
