Abstract-In the context of cryptanalysis, computing discrete logarithms in large cyclic groups using index-calculus-based methods, such as the number field sieve or the function field sieve, requires solving large sparse systems of linear equations modulo the group order. Most of the fast algorithms used to solve such systems -e.g., the conjugate gradient or the Lanczos and Wiedemann algorithms -iterate a product of the corresponding sparse matrix with a vector (SpMV). This central operation can be accelerated on GPUs using specific computing models and addressing patterns, which increase the arithmetic intensity while reducing irregular memory accesses. In this work, we investigate the implementation of SpMV kernels on NVIDIA GPUs, for several representations of the sparse matrix in memory. We explore the use of Residue Number System (RNS) arithmetic to accelerate modular operations. We target linear systems arising when attacking the discrete logarithm problem on groups of size 160 to 320 bits, which are relevant for current cryptanalytic computations. The proposed SpMV implementation contributed to solving the discrete logarithm problem in GF(2 619 ) and GF(2 809 ) using the FFS algorithm.
I. INTRODUCTION
The security of many cryptographic protocols used for authentication, key exchange, encryption, or signature, depends on the difficulty of solving the discrete logarithm problem (DLP) in a given cyclic group [16] . For instance, we can rely on the hardness of the DLP in a multiplicative subgroup of a finite field. There are algorithms, such as Pollard-rho [17] or Baby-Step/Giant-Step [21] that solve the problem in time exponential in the subgroup size. Another family of methods, known as Index-calculus methods [4] propose to solve it in time sub-exponential in the finite field size. These algorithms require in their linear algebra step the resolution of large sparse systems of linear equations modulo the group order [14] . In cryptographic applications, the group order ℓ is of size 160 to 320 bits. The number of rows and columns of the corresponding matrices is in the order of hundreds of thousands to millions, with only hundreds or fewer non-zero elements per row. This linear algebra step is a serious limiting factor in such algorithms. For example, it was reported in [12] that the linear algebra step of the Function Field Sieve (FFS) implementation to solve the DLP over GF (3 6×97 ) took 80.1 days on 252 CPU cores, which represents 54% of the total time.
To solve such systems, ordinary Gaussian elimination is inefficient. While some elimination strategies aiming at keeping the matrix as sparse as possible can be used to reduce the input system somewhat, actual solving calls for the use of other techniques (Lanczos [15] , Wiedemann [27] ) that take advantage of the sparsity of the matrix [18] . Either for Lanczos, Wiedemann or their block variants, the iterative sparsematrix-vector product is the most time-consuming operation. For this reason, we investigate accelerating this operation on GPUs.
The paper is organized as follows. Section II presents the background related to the hardware and the context. Section III discusses the arithmetic aspects of our implementation. We present several matrix formats and their corresponding implementations in Section IV. We compare the results of different implementations run on NVIDIA GPUs in Section V, and present optimizations based on hardware considerations in Section VI. Section VII discusses our reference software implementation. Section VIII describes how our implementation have contributed on solving the linear algebra step of the FFS algorithm in GF(2 619 ) and GF(2 809 ). Finally, Section IX concludes the paper.
II. BACKGROUND

A. GPUs and the CUDA programming model
CUDA is the hardware and software architecture that enables NVIDIA GPUs to execute programs written in C, C++, OpenCL and other languages [2] .
A CUDA program instantiates a host code running on the CPU and a kernel code running on the GPU. The kernel code runs according to the Single Program Multiple Threads (SPMT) execution model across a set of parallel threads. The threads are executed in groups of 32, called warps. As a warp only has a single instruction fetch/decode unit, each instruction in the execution path is issued to all the threads in the warp. However, if one or more threads have a different execution path, execution divergence occurs. The different paths will then be serialized, negatively impacting the performance.
The threads are further organized into thread blocks and grids of thread blocks:
• A thread executes an instance of the kernel. It has a unique thread ID within its thread block, along with registers and private memory.
• A thread block is a set of threads executing the same kernel which can share data through shared memory and perform barrier synchronization which ensures that all threads within that block reach the same instruction before continuing. It has a unique block ID within its grid.
• A grid is an array of thread blocks executing the same kernel. All the threads of the grid can also read inputs, and write results to global memory. At the hardware level, the blocks are distributed on an array of multi-core Streaming Multiprocessors (SMs). Each SM schedules and launches the threads in groups of warps. Recent GPUs (such as the NVIDIA Fermi architecture) allow for up to 48 active warps per SM. The ratio of active warps to the maximum supported is called occupancy. Maximizing the occupancy is important, as it helps hiding the memory latency. One should therefore pay attention to the usage of shared memory and registers in order to maximize occupancy.
Another important performance consideration in programming for the CUDA architecture is coalescing global memory accesses. To understand this requirement, global memory should be viewed in terms of aligned segments of 32 words of 32 bits each. Memory requests are serviced for one warp at a time. If the warp requests hit exactly one segment, the access is fully coalesced and there will be only one memory transaction performed. If the warp accesses scattered locations, the accesses are uncoalesced and there will be as many transactions as the number of hit segments. Consequently, a kernel should use a coalescing-friendly pattern for greater memory efficiency.
Despite their high arithmetic intensity and their large memory bandwidth, GPUs provide small caches. In fact, Fermi GPUs provide the following levels of cache for the on-board DRAM (local and global memory):
• 768-kB L2-cache per GPU.
• 16-kB L1-cache (per SM). It can be extended to 48 kB, but this decreases shared memory from 48 kB to 16 kB. • A texture cache: an on-chip cache for the read-only texture memory. It can accelerate memory accesses when neighboring threads read from nearby addresses.
B. Sparse-Matrix-Vector product on GPUs
Sparse-matrix computations pose some difficulties on GPUs, such as irregular memory accesses, load balancing and low cache efficiency. Several papers have focused on choosing suitable matrix formats and appropriate kernels to overcome the irregularity of the sparse matrix [6] , [7] , [26] . These works have explored implementing efficiently SpMV over real numbers. Schmidt et al. [19] proposed an optimized matrix format to accelerate exact SpMV over GF(2), that can be used in the linear algebra step of the Number Field Sieve (NFS) for integer factorization [23] . Boyer et al. [10] have adapted SpMV kernels over small finite fields and rings Z/mZ, where they used double-precision floating-point numbers to represent ring elements. In our context, since the order of the considered finite ring is large (hundreds of bits), specific computing models and addressing models should be used.
In this work, we have a prime ℓ (typically between 160 and 320 bits), along with an N -by-N sparse matrix A defined over Z/ℓZ, and we want to solve the linear system Aw = Fig. 1 . Approximative distribution of non-zero elements in an FFS matrix 0 over Z/ℓZ. A feature of the index calculus context that we consider here, is that A contains small values (e.g. 32-bit integers). In fact, around 90% of the non-zero coefficients are ±1. The very first columns of A are relatively dense, then the column density decreases gradually. The row density does not change significantly. We denote by n NZ the number of nonzero elements in A. See Figure 1 for an example matrix taken from an FFS computation.
We will use the Wiedemann algorithm as a solver. This algorithm iterates a very large number of matrix-vector products of the form v ′ ← Av, where v and v ′ are dense Ncoordinate vectors. The major part of this work deals with how to accelerate this product.
In order to carry out this product, we compute the dot product between each row of A and the vector v. The basic operation is of the form x ← (x + λy) mod ℓ, where λ is a non-zero coefficient of A, and x and y are coordinates of the vectors v ′ and v, respectively. To minimize the number of costly reductions modulo ℓ, we accumulate computations, and postpone the final modular reduction of the result as late as possible. When iterating many products (computations of the form A i v), we can further accumulate several SpMVs before reducing modulo ℓ, as long as the intermediate results do not exceed the largest representable integer. As far as arithmetic over Z/ℓZ is concerned, we chose to use the Residue Number System, which appears to be more suited to the fine grained parallelism inherent to the SPMT computing model than the usual multi-precision representation of large integers. A comparison of the two representations will be proposed in Subsection V-E.
III. RESIDUE NUMBER SYSTEM AND MODULAR ARITHMETIC
A. A brief remainder on RNS
The Residue Number System (RNS) is based on the Chinese Remainder Theorem (CRT). Let B = (p 1 , p 2 , . . . , p n ) be a set of coprime integers, which we call an RNS-basis. We define P as the product of all the p i 's. The RNS uses the fact that any integer x within [0, P − 1] can be uniquely represented by the list (x 1 , x 2 , . . . , x n ), where each x i is the residue of x modulo p i , which we write as
This number system is particularly interesting for arithmetic over large integers, since it distributes the computation over several small residues. In other words, the computation units that will work on the residues are independent and need no synchronization nor communication, as there is no carry propagation [24] , [25] .
If x and y are given in their RNS representations x RN S = (x 1 , . . . , x n ) and y RN S = (y 1 , . . . , y n ), according to B, and such that x, y < P , RNS addition and multiplication are realized by modular addition and multiplication on each component:
The result (e.g., x + y) should belong to the interval [0, P − 1] if we want to obtain a valid RNS representation. Otherwise, it will be reduced modulo P .
We can convert back an RNS vector to the integer form by using the CRT formula:
, where P i P/p i .
B. RNS reduction modulo ℓ
In the chosen RNS representation, (P − 1) is the largest representable integer. So in the case of repeated SpMVs over Z/ℓZ, we can accumulate at most log( P ℓ−1 )/ log(r) matrixvector products before having to reduce modulo ℓ, where r corresponds to the largest row norm (defined as the sum of the absolute values of its elements) in the matrix. To reduce the vector v ′ modulo ℓ, we use the method introduced by Bernstein in [8] , which allows us to perform the reduction without having to convert the vector back to the integer form.
We assume that the RNS-basis B contains n moduli p 1 , . . . , p n of k bits each. We impose that the p i 's are close to 2 k . The reasons will be detailed in the following subsection. We want to reduce modulo ℓ an RNS vector (x 1 , . . . , x n ).
We start from the CRT reconstruction:
where we have defined γ i
. Let us also define the integer α as follows
The vector x can then be written as
γ i P i − αP and, since γ i < p i , we have that 0 ≤ α < n. Now, if we assume that α is known, we define
We can easily check that z is congruent to x modulo ℓ and lies in the interval [0, ℓ
What remains to be done is to determine α. Since p i ≈ 2 k , we approximate the quotient γ i /p i using only the s most significant bits of γ i /2 k . Hence, an estimate for α is proposed asα
where s is an integer parameter in [1, k] and ∆ an error correcting term in ]0, 1[. Bernstein states in [8] 
Once α is determined, we are able to perform an RNS computation of z. Algorithm 1 summarizes the steps of the computation.
Algorithm 1: RNS modular reduction
Inputs
Broadcast of the γ j 's by all the threads 4 foreach Thread j do
All the operations can be evaluated in parallel on the residues, except for step 3, where a broadcast of all the γ j 's is needed.
Even if the obtained result z is not the exact reduction of x, it is bounded by n2 k ℓ. Thus, we guarantee that the intermediate results of the SpMV computation do not exceed a certain bound less than P . Notice that this RNS reduction algorithm imposes that P be one modulus (k bits) larger than implied by the earlier condition ℓ < P .
In conclusion, P is chosen, such that r × n2 k ℓ < (1 − ∆)P , with r the largest row norm of the matrix.
C. RNS using primitive data types
We represent the finite ring Z/ℓZ as the integer interval [0, ℓ − 1]. Each element is stored in its RNS form. Each RNS residue is represented using a primitive data type: either integers in basis 2 32 or 2 64 , floats in simple or double precision.
The basic RNS operation is z j ← (x j + λ · y j ) mod p j , where 0 ≤ x j , y j , z j < p j are RNS residues and λ a positive element of the matrix. So, it consists of an AddMul followed by a reduction modulo p j . To speed up the reduction modulo p j , the moduli are of the pseudo-Mersenne form 2 k − c j , with c j as small as possible.
In fact, let us define t j x j + λ · y j as the intermediate result before the modular reduction. t j can be written as
where
. So, we compute t j ← t jL + t jH · c j , then we have to consider two cases:
• else we have reduced t j by approximately k bits. Thus, we repeat the previous procedure with t j ← t jL + c j ·t jH The output lies in [0, 2 k − 1], so we propose to relax the condition on the input and output:
. With this approach, the reduction can be done in a small number of additions and products [22] .
We now discuss the various native data types for representing residues.
1) RNS mapped on integers:
k is fixed as a multiple of machine word size, such that 32 or 64 bits, so that the computation of the high and low parts and the comparison with 2 k are cheap. For λ = 1 (cf. Algorithm 2), the algorithm is even simplified, since the high part is one bit. When the high part is non-zero (overflow detection), we need to subtract p j , which is equivalent to adding a c j , since we compute modulo 2 k .
Algorithm 2: RNS Add (k-bit integers)
For λ > 1 (cf. Algorithm 3), the algorithm is valid as long as |z jL × c j | < 2 k . This ensures correctness for |λ| up to 2 k−k0 , if we bound c j by 2 k0 . For the problem sizes considered, we choose k = 64, and taking c j < 2 8 is sufficient, since it allows to consider sufficiently many p j 's to work with Z/ℓZ up to ℓ ≈ 2 1000 .
2) RNS mapped on floating-point numbers:
When using floats to hold an RNS residue, only the mantissa bits are used, so the number of needed RNS moduli increases. We choose k one bit less than the number of mantissa bits, so that the standard floating point addition gives the exact summation of two residues.
To perform the RNS AddMul using floats (cf. Algorithm 5), we start by computing the upper and lower 53-bit parts of Algorithm 3: RNS AddMul (k-bit integers)
Algorithm 4: RNS Add (double-precision floats) 
Algorithm 5: RNS AddMul (double-precision floats)
Inputs : p j < 2 52 and c j < 2 8 , such that p j = 2 52 − c j 0 ≤ x j < 2 52 , 0 ≤ y j < p j and 0 ≤ λ < 2
D. Possible RNS Mappings on GPU/CPU
On GPU, we opted for 64-bit moduli, for performance considerations. Each RNS residue is mapped on a vector type uint2, that holds two 32-bit words. We use the PTX (parallel thread execution) pseudo-assembly language for CUDA [3] to implement the RNS operations.
On CPU, we implemented three versions based on:
• MMX instruction set: we map an RNS residue on an unsigned 64-bit integer.
• Streaming SIMD Extensions (SSE) set: a 128-bit XMM register holds two residues, so the processor can process two residues simultaneously. The moduli are 63 bits long, since only signed values are supported.
• Advanced Vector Extensions (AVX) set: we use a 256-bit YMM register to hold four residues. As the intrinsics for packed arithmetic are available for floating-point formats only, we use the 64-bit floating point components to hold four 52-bit residues.
IV. SPARSE MATRIX STORAGE FORMATS
In this part, we will discuss how to store the non-zero coefficients of A and the strategy for processing them. In the following pseudo-code, the vectors v and w are noted src and dst, respectively.
Processing a non-zero coefficient λ at row i and column j of the matrix A means reading the n RNS residues that compose the j th element in src, multiply them by λ and adding them to the n RNS residues that compose the i th element in dst.
A. Compressed Sparse Row (CSR)
The CSR format stores the column indices and the values of the non-zero elements of A into two arrays of n NZ elements: id and data. A third array of pointers, ptr, of length N +1, is used to indicate the beginning and the end of each row. Non-zero coefficients are sorted by their row index. The CSR format eliminates the explicit storage of the row index, and is convenient for a direct access to the matrix, since ptr indicates where each row starts and ends in the other two ordered arrays.
1) Scalar approach:
To parallelize the product for the CSR format, a simple way is to assign one thread for each row (scalar approach) (cf. Listing 1). Temporary results are stored in registers and the final result is written to global memory. For simplicity, the RNS AddMul is denoted in the running sum by the operators "*", "+" and "%". BLOCK_SIZE, blockIdx.x and threadIdx.x refer to the number of threads in the block, the block index and the thread index within the block, respectively. The major drawback of the scalar approach is that making each thread iterate over all the RNS residues increases the number of registers per thread. This approach also suffers from a poor global memory load efficiency, because the threads within a same warp access the vectors id and data in a non-contiguous fashion.
2) Vector approach:
The vector approach consists in assigning a warp to each row of the matrix. The threads within a warp access neighboring non-zeros elements, which makes the warp accesses to id and data contiguous. Each thread computes its partial result in shared memory, then a parallel reduction in shared memory is required to combine the perthread results. No synchronization is needed, since threads belonging to a same warp are implicitly synchronized.
However, in the context of RNS arithmetic, the vector scheme still suffers from a low load/write efficiency when accessing src and dst, because threads within the same warp simultaneously access residues of different source vector elements, which are not contiguous.
3) Residue-vector approach:
To overcome the limitations of the previous approaches, we propose to organize the threads within a warp into n GPS groups of n threads, where n GPS × n (that we denote by n CUS ) is closest to 32, the number of threads per warp. Each group is associated to a non-zero matrix coefficient (cf. Listing 2). For instance for n = 5, we take n GPS = 6, so the first 5 threads process in parallel the residues of the 1 st source vector element, threads 5 to 9, process the 2 nd source vector element, and so on, and we will have two idle threads per warp. As for the vector approach, a reduction is needed to combine the results of threads working on the same residue and belonging to different groups. In this scheme, which we call residue-vector, the number of registers is reduced by eliminating the per-thread loop to process all the residues of a vector element, and contiguous accesses to src and dst are performed.
The three mentioned approaches scalar, vector and residuevector are not specific to the CSR format. For the following formats, all three approaches can be used, but we will mainly describe the residue-vector implementations.
B. Coordinate (COO)
The format COO consists of three arrays row_id, col_id and data of n NZ elements. The row/column indices and the value are explicitly stored to specify a non-zero matrix coefficient. In this work, we propose to sort the matrix coefficients by their row index.
A typical way to work with the COO format on GPU is to assign one thread to each non-zero matrix coefficient. This implies that different threads from different warps will process a same row. To combine their results, one possibility is to do atomic updates on global memory to the result vector dst, which significantly decreases the performance. Another possibility is that each thread computes its partial result, then performs a segmented reduction [9] , [20] to sum the partial results of the other threads belonging to the same warp and spanning the same row. We followed the scheme proposed by the library CUSP 1 , which performs the segmented reduction in shared memory, using the row indices as segment descriptors.
As for the residue-vector CSR kernel, we assign a group of threads to each source vector element. Each warp iterates over its interval, processing n GPS elements at a time. If a spanned row is fully processed, its result is written to dst, otherwise, the row index and the partial dot product are stored in temporary arrays. Then, a second kernel performs the combination of the per-warp results.
C. Sliced Coordinate (SLCOO)
The SLCOO format was inspired from the CADO-NFS [1] software for CPUs. It was introduced for GPUs by Schmidt et al. for integer factorization, in the particular case of matrices over GF(2) [19] . The aim of this format is to increase the cache hit rate that limits the CSR and COO performance. Like COO, the SLCOO representation stores the row indices, column indices and values. However, it divides the matrix into horizontal slices, where the non-zero coefficients of a slice are sorted according to their column index in order to reduce the irregular accesses on source vector src, if they had been sorted by their row indices. A fourth array ptrSlice indicates the beginning and end of each slice.
For the SLCOO kernel (cf. Listing 3), each warp works on a slice. A group of n threads processes in parallel a non-zero matrix coefficient, each thread being assigned to a residue. Since each thread works on more than one row, it needs to have individual storage for its partial per-row results, or to be able to have exclusive access to a common resource. In [19] , where a thread block had been assigned to each slice, three possibilities have been mentioned to solve this issue:
• Small SLCOO: each thread has one exclusive entry in shared memory to store the partial result for each row.
• Medium SLCOO: threads having the same lane (index within the warp) share one entry per row in shared memory and access it by an atomic XOR operation.
1 http://code.google.com/p/cusp-library/
• Large SLCOO: all threads share one entry per row. The Medium SLCOO allows one to put (BLOCK SIZE/32) times as many rows on one slice than the Small SLCOO. Similarly, the Large SLCOO allows one to fill 32 times as many rows than the Medium SLCOO. This way, increasing the number of rows per slice (from Small to Large variants) increases the texture cache hit rate, which compensates the drawbacks of the atomic accesses.
However, because of the reduction, it is not possible to perform an atomic addition in RNS. For this reason, we only implement the Small SLCOO. Furthermore, this kernel suffers from the excessive usage of shared memory, since it stores elements of Z/ℓZ in our case. 
D. ELLPACK (ELL)
The ELL format extends the CSR arrays to N -by-K arrays, where K corresponds to the maximum number of non-zero coefficients per row. The rows that have less than K non-zero coefficients are padded. As in the CSR, elements are sorted by their row indices. Since the padded rows have the same length, only column indices are explicitly stored.
This format suffers from the overhead due to the padding, which is why it has generally been combined with the COO format. By taking K smaller than the maximum number of non-zero coefficients, we can tune its value so as to minimize the adverse effects of padding, while the remaining elements are stored in COO format. There are in the literature other SpMV formats such as DIA (Diagonal) format, that are appropriate only for matrices that satisfy some sparsity patterns, which is not our case.
V. COMPARATIVE ANALYSIS OF SPMV KERNELS
The experiments presented in this section were run on an NVIDIA GeForce GTX 580 graphics processor. Each SpMV kernel was executed 100 times. We report the computational throughput in terms of GFLOP/s, which we determine by dividing the number of required operations (twice the number of non-zero elements in A multiplied by 2 × n) by the running time. Our measurements do not include the time spent to copy data between the host and the GPU, since the matrix and vectors do not need to be transferred back and forth between each SpMV iteration. These delays are thus dwarfed by the computation latencies. The reduction modulo ℓ happens only once every few iterations, which is why the timing of an iteration includes the timing of the reduction modulo ℓ kernel multiplied by the frequency of its invocation. The reported measurements are based on NVIDIA developer tools. Table I summarizes the considered matrices over Z/ℓZ, where ℓ is a prime. The matrices were obtained during the resolution of discrete logarithm problems in the 217-bit and the 202-bit prime order subgroups of GF(2 619 ) and GF(2 809 ), respectively, using the FFS algorithm. For the two matrices, the Z/ℓZ elements fit in four 64-bit residues. Since, there is an extra 64-bit residue needed for the modular reduction, the total number of residues is n = 5. In this comparative section, we report only results obtained with the FFS-619 matrix, run on the NVIDIA GeForce GTX 580 card. But the two matrices have close properties. In Section VIII, results related to the two matrices will be mentioned.
A. Comparison of CSR variants
Benchmarks for the three variants of CSR are presented in Table II . The scalar kernel uses a large number of registers, which limits the maximum number of warps that can be scheduled on an SM to 24 active warps (per SM), out of the 48 supported. This is reported on the theoretical occupancy column in Table II . On the other hand, the vector kernel suffers from shared memory usage (4.5kB for a 96-thread block), which also limits the maximal reachable occupancy to 50%. For these two kernels, the low occupancy significantly decreases the performance. Compared to the vector kernel, the scalar one reaches higher throughput, since it allows an L1-oriented configuration (48kB L1, 16kB shared) of the onchip memory, while for the vector kernel, a shared-oriented configuration (16kB L1, 48kB shared) is required.
Concerning the global memory access pattern, the column Global Load/Store Efficiency gives the ratio of requested memory transactions to the number of transactions performed, which reflects whether accesses are all perfectly coalesced (100% efficiency) or not. For scalar and vector kernels, uncoalesced accesses cause the bandwidth loss and the performance degradation.
The residue-vector (CSR-RV) kernel satisfies better the GPU architectural characteristics. It makes the write accesses coalesced (100% store efficiency). For the loads, we obtain only 45% efficiency, because a warp accesses several non-zero coefficients, which are not necessarily contiguous (sparsity of A).
B. COO/CSR-RV comparison
Due to the segmented reduction, the COO kernel performs more instructions and requires more registers than the CSR-RV kernel. Thread divergence happens more often, because of the several branches that threads belonging to the same warp can take. Memory access patterns for writing is less efficient than with the CSR-RV kernel (cf. Store Efficiency in Table IV) , as some results are written to dst, while the rest is written to the temporary vector. Consequently, the COO kernel reaches a lower throughput than the CSR-RV one.
C. SLCOO/CSR-RV comparison
Table III compares several SLCOO kernels for different slice sizes. We remark that increasing the slice size improves the cache hit rate, since accesses on the source vector are less irregular. However, by making the slices larger, we increase the usage of shared memory proportionally to the slice size, which limits the maximal number of blocks that can run concurrently. This limitation of the occupancy yields poor performance compared to the CSR-RV kernel.
D. ELL/CSR-RV comparison
As far as the ELL kernel is concerned, the padded rows have the same length. This yields a good balancing across the warps and the threads (cf. Occupancy and Thread Divergence in Table IV ), compared to the CSR-RV kernel. However, the maximum row length (K = 418) is too large, since there are only 100 non-zero coefficients per row on average. This high overhead makes the performance poor.
The combination of ELL and COO formats reduces significantly the overhead. The best kernel is obtained for K = 106. Even though the ELL/COO kernel reaches good performance, it does not allow to reach higher performance in terms of timing than the CSR-RV kernel.
Similar combinations of different formats have been tested, such as CSR-RV/COO. However they do not give better results. Splitting the matrix into column major blocks, or processing separately the first dense columns did not improve the performance as well. For our matrix, the main bottleneck is memory access. In the CSR-RV kernel, 72% of the time is spent in reading data, 2% in writing data and 26% in computations.
E. Comparison of RNS and Multi-precision arithmetics
We implemented the RNS and the multi-precision (MP) arithmetics.
For the MP representation, to perform the reduction modulo ℓ, we use a precomputed inverse of ℓ so as to divide by ℓ using a single multiplication. For the 217-bi prime order subgroup, choosing the largest representable integer M = 2 320 − 1 is sufficient to accumulate a few number of SpMVs before reducing modulo ℓ. The maximum row norm that we have (492) allows to do up to 4 iterative SpMVs before having to reduce.
For MP kernel, the reduction kernel takes 2 ms, which corresponds to 0.5 ms per iteration. In RNS, it takes 1.6 ms (i.e., 0.4 ms per iteration).
The idea behind the use of RNS rather than MP arithmetic is that RNS can significantly decrease data sharing between the threads and arithmetic operations required for the carry generation/propagation (cf. Table V) . The RNS kernel allows us to reach higher occupancy and better performance. The speed-up of RNS compared to multi-precision on the SpMV timing is around 15%.
VI. IMPROVEMENTS ON CSR-RV KERNEL
To improve the kernel performance, one should take into account the GPU architectural characteristics: the management of the memory accesses and the partitioning of the computations. The effects and the results corresponding to each improvement are reported in Table VI-D.
A. Texture caching
Although our SpMV kernel suffers from irregular load accesses, a thread is likely to read from an address near the addresses that nearby threads (of the same group) read. For this reason, we enable the texture caching by binding the source vector on texture memory and by replacing reads of the form src[j] with texture fetches tex1Dfetch(src_tex,j). This improves the global memory efficiency and consequently the throughput.
B. Reordering the non-zero coefficients of a row
When computing discrete logarithms, most of the coefficients of the matrix are ±1. It seems promising to treat multiplications by these coefficients differently from other coefficients: additions and subtractions are less expensive than multiplications. Moreover, we are not able to produce the same code for positive and negative values, so negative coefficients are processed differently as well. All these separations result in code divergence, that we fix by reordering the nonzeros in the matrix such that values of the same category (+1, −1, > 0, < 0) are contiguous. This decreases the branch divergence and increases the total throughput.
C. Compressing the values array data
Since the majority of the coefficients are ±1, after reordering the coefficients per row, we can replace the ±1 coefficients by their occurrence count. This reduces the length of the values array data by more than 10 times, and so reduces the number of reads. This improvement allows to reduce the usage of Device RAM, which can be a critical issue for larger matrices.
Another array, ptr_data, of length N + 1 is used to indicate where each row starts and ends in the array data. 
D. Improving warp balancing
In the proposed kernel (cf. Listing 2), each warp processes a single row. This requires launching a large number of warps. Consequently, there is a delay to schedule those launched warps. Instead, we propose that each warp iterates over a certain number of rows. Although this increases the usage of registers (from 19 per thread to 24), it improves the achieved occupancy.
To further increase the occupancy, we permute the rows such that each warp roughly gets the same work load.
VII. REFERENCE SOFTWARE IMPLEMENTATION
For comparison purposes, we implemented SpMV on the three software instruction set architectures MMX, SSE and AVX, based on the RNS representation for the arithmetic and the CSR-RV format for the storage of the matrix. We have not explored other formats that can be suitable for CPU. Probably blocked formats that better use the cache can further improve the performance on CPU.
The experiment was run on an Intel Xeon CPU E5-2650 (2.0GHz) using 8 threads on 8 cores (cf. Table VII ). The AVX implementation using floats permits to reach the highest throughput. However, the fact that we use floating point operations and the fact that the number of moduli is a multiple of four entail heavy overheads, and the SSE implementation using integers ends up being the fastest. Software performance is 4 to 5 times slower than on one graphics processor.
VIII. RESOLUTION OF LINEAR ALGEBRA OF THE FUNCTION FIELD SIEVE
The linear algebra step consists in solving the system Az = 0, where A is the matrix produced by the filtering step of the FFS algorithm (cf. Table I) . A is singular and square. Finding a vector of the kernel of the matrix is generally sufficient for the FFS algorithm.
The simple Wiedemann algorithm [27] which resolves such a system, is composed of three steps:
• "Krylov": It consists on the computation of a sequence of scalars a i = t xA i y, where 0 ≤ i ≤ 2N , and x and y are random vectors in (Z/ℓZ) N . We take x in the canonical basis, so that instead of performing a full dot product between t x and A i y, we just store the element of A i y that corresponds to the non-zero coordinate of x.
• "Lingen": Using the Berlekamp-Massey algorithm, this step computes a linear generator of the a i 's. The output F is a polynomial whose coefficients lie in Z/ℓZ, and whose degree is very close to N .
• "Mksol": The last step computes
A i F i y, where F i is the i th coefficient of F . Like "Krylov", this step iterates Sparse-Matrix-Vector products. The result is with high probability a non-zero vector of the kernel of A.
The Block Wiedemann algorithm [13] proposes to use m random vectors for x and n random vectors for y (do not confuse the blocking parameter n with the number of RNS residues n previously mentioned). The sequence of scalars is thus replaced by a sequence of m × n matrices and only (N/n+N/m) "Krylov" iterations and N/n "Mksol" iterations are needed. The n subsequences can be computed independently and in parallel. So, the block Wiedemann method allows one to distribute the computation without an additional overhead.
A. Linear Algebra of FFS for GF(2
619 )
The computation was completed using the simple Wiedemann algorithm on a single NVIDIA GeForce GTX 580 for the "Krylov" and "Mksol" steps. The "Lingen" step was computed using the CADO-NFS software [1] running on an Intel Core i5-2500 CPU. The overall computation needed 16 GPU hours and 1 CPU hour.
B. Linear Algebra of FFS for GF(2
809 ) [5] Two independent computations were carried out, and the choice of these two setups was driven by the hardware which was available to us at the time of the computation. • A simple Wiedemann algorithm was run on a single node equipped with two NVIDIA GeForce GTX 680 graphics processors. The computation time for this setup took 18 days: 12 days on both GPUs for the "Krylov" computation, 35 minutes for the "Lingen" step, and 6 days on both GPUs for the "Mksol" computation.
• Another option was to run a Block Wiedemann using a different computing facility. We used 4 distinct nodes, each equipped with two NVIDIA Tesla M2050 graphics processors, and ran the Block Wiedemann algorithm with blocking parameters m = 8 and n = 4. The "Krylov" computation required 2.6 days in parallel on the 4 nodes. The "Lingen" step took 2 hours in parallel using 16 jobs on a 4-node cluster with Intel Core i5-2500 CPUs (3.3GHz) connected with Infiniband QDR network. The "Mksol" computation required 1.8 days in parallel on the 4 GPU nodes.
IX. CONCLUSION
We have investigated different data structures to perform iterative SpMV for DLP matrices on GPUs. We have adapted the kernels for the context of large finite fields and added optimizations suitable to the sparsity and the specific computing model. The CSR format based on the residue-vector approach appears to be the most efficient one. The SLCOO poses for the sizes that we use some hardware difficulties that nullify its contribution on increasing the cache hit rate. Future GPUs may enhance the performance. We have shown that using RNS for finite field arithmetic provides a considerable degree of independence, which can be exploited by massively parallel hardware.
For the largest matrix we processed (FFS on GF(2 809 )), 25% of the Device RAM has been used. To compute discrete logarithms in larger fields such as GF(2 1039 ), we should deal with matrices of dimension 20 times larger. Other issues should be explored, such as the matrix and the vectors splitting, multi-GPU communication, the use of a CPU cluster, etc.
ACKNOWLEDGMENTS
The author is grateful to the members of the project-team CARAMEL for many inspiring conversations and helpful suggestions and for providing several DLP matrices.
