Abstract. CUMODP is a CUDA library for exact computations with dense polynomials over finite fields. A variety of operations like multiplication, division, computation of subresultants, multi-point evaluation, interpolation and many others are provided. These routines are primarily designed to offer GPU support to polynomial system solvers and a bivariate system solver is part of the library. Algorithms combine FFT-based and plain arithmetic, while the implementation strategy emphasizes reducing parallelism overheads and optimizing hardware usage.
With the advent of hardware accelerator technologies, multi-core processors and Graphics Processing Units (GPUs), this reduction to multiplication is, of course, still desirable, but becomes more complex since both algebraic complexity and parallelism need to be considered when selecting and implementing a multiplication algorithm. In fact, other performance factors, such as cache usage or CPU pipeline optimization, should be taken into account on modern computers, even on single-core processors. These observations guide the developers of projects like SPIRAL 8 [16] or FFTW 9 [3] .
The CUMODP library provides arithmetic operations for dense matrices and dense polynomials primarily with modular integer coefficients, targeting manycore GPUs. Some operations are available for integer or floating point coefficients as well. A large portion of the CUMODP library code is devoted to polynomial multiplication and the integration of that operation into higher-level algorithms.
Typical CUMODP operations are matrix determinant computation, polynomial multiplication (both plain and FFT-based), univariate polynomial division, the Euclidean algorithm for univariate polynomial GCDs, subproduct tree techniques for multi-point evaluation and interpolation, subresultant chain computation for multivariate polynomials, bivariate system solving. The CUMODP library is written in CUDA [15] and its source code is publicly available at www.cumodp.org.
In this note, we give an overview of the implementation techniques of the CUMODP library. In Section 2, we discuss a model of multithreaded computation, combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. For each key routine of the CUMODP library this model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter, e.g. number of threads per block. Experimentation confirms the effectiveness of this model. Secondly, the design of the CUMODP library emphasizes the importance of adaptive algorithms in the context of many-core GPUs, see Section 3. that is, algorithms which adapt their behavior according to the available computing resources. Based on these techniques, we have obtained the first GPU implementation of subproduct tree techniques for multi-point evaluation and interpolation of univariate polynomials. Hence we compare our code against probably the best serial C code, namely the FLINT library, for the same operations. For sufficiently large input data and on NVIDIA Tesla C2050, our code outperforms its serial counterpart by a factor ranging between 20 to 30.
We conclude in Section 4 by presenting an application of the CUMODP library to bivariate system solving.
2 A many-core machine model for designing algorithms with minimum parallelism overheads
Our model of multithreaded computation [9] extends the following previous works for which we summarize key features and limitations. The PRAM (parallel random access machine) model [5] supports data parallelism but not task parallelism. Moreover, this model cannot support memory traffic issues like cache complexity and memory contention. The Queue Read Queue Write PRAM [6] considers memory contention, however, it unifies in a single quantity time spent in arithmetic operations and time spent in read/write accesses. We believe that this unification is not appropriate for recent many-core processors, such as GPUs, for which the ratio between one global memory read/write access and one floating point operation can be in the 100's. The TMM (Threaded Many-core Memory) model [12] retains many important characteristics of GPU-type architectures, however, the running time estimate on P cores is not given by a Graham-Brent theorem [7] . We believe that, for the purpose of code optimization, this latter theorem is an essential tool. Our proposed many-core machine model (MMM) aims at optimizing algorithms targeting implementation on GPUs. Our abstract machine possesses an unbounded number of streaming multiprocessors (SMs). However, each SM has a finite number of processing cores and a fixed-size local memory. An MMM machine has a two-level memory hierarchy, comprising an unbounded global memory with high latency and low throughput while SMs local memories have low latency and high throughput. Similarly to a CUDA program, an MMM program specifies for each kernel the number of thread-blocks and the number of threads per thread-block. An MMM machine has two parameters:
U : time (expressed in clock cycles) to transfer one machine word between the global memory and the local memory of any SM, Z: size (expressed in machine words) of the local memory of any SM.
An MMM program P is a directed acyclic graph (DAG), called the kernel DAG, whose vertices are kernels and edges indicate serial dependencies. Since each kernel of the program P decomposes into a finite number of thread-blocks, we map P to a second graph, called the thread-block DAG of P, whose vertex set consists of all thread-blocks of P. We consider three complexity measures:
-the work W (P), which is the total number of local operations (arithmetic operation, read/write requests in the local memory) performed by all threads, -the span S(P), which is the longest path, counting the weight (span) of each vertex (kernel), in the kernel DAG, -the parallelism overhead O(P), which is the total data transfer time (between global and local memories) of all its kernels.
Using these complexity measures, we derive a Graham-Brent theorem with parallelism overhead.
Theorem 1. Let K be the maximum number of thread blocks along an antichain of the thread-block DAG of P. Then the running time T P of the program P satisfies:
where N (P), L(P) and C(P) are respectively: the number of vertices in the thread-block DAG, the critical path length (where length of a path is the number of edges in that path) in the thread-block DAG and the maximum running time of local operations by a thread among all the thread-blocks.
We
. If this holds, we view P(s min ) as a solution of our problem of algorithm optimization (in terms of parallelism overheads). For each operation, the program parameter s controls the amount of work and parallelism overheads of a thread-block. Figures 1 and 2 illustrate the role of this parameter in our implementation of plain division. See [9] for details. Applying the optimization strategy described above lead us to determine an optimum value of s among those implied by constraints like the size of the local memory Z or the data transfer time U . For plain polynomial multiplication, this analysis suggested to minimize s which was verified experimentally, as illustrated by Figure 3 . For the Euclidean algorithm, our analysis suggested to maximize the program parameter s, which was again verified experimentally, as illustrated by Figure 4 . Our experimental results were obtained on a GPU card NVIDIA Tesla C2050. Figures 5 and 6 show that the optimized CUMODP implementation of the plain division and the Euclidean algorithm outperforms the NTL implementation of the FFT-based plain division and polynomial GCD computation. Of course, CUMODP code is multithreaded while NTL code is serial. On the other hand, NTL uses asymptotically fast algorithms. The key observation is that optimized implementation of multithreaded plain algorithms provide useful alternative to any serial code. In fact, as we will see in the next section, multithreaded plain algorithms play an essential in higher-level applications targeting many-core GPUs.
Adaptive algorithms
Up to our knowledge, the CUMODP library offers the first GPU implementation of subproduct tree techniques [4] [Chapter 10] for multi-point evaluation and interpolation of univariate polynomials. The parallelization of those techniques raises the following challenges on hardware accelerators.
1. The divide-and-conquer formulation of operations on subproduct-trees is not sufficient to provide enough parallelism and one must also parallelize the underlying polynomial arithmetic operations, in particular polynomial multiplication. 2. During the course of the execution of a subproduct tree operation (construction, evaluation, interpolation), the degrees of the involved polynomials vary greatly; thus, so does the work load of the tasks, which makes those algorithms complex to implement on many-core GPUs.
To address the first challenge on many-core GPUs, we combine parallel plain arithmetic and parallel fast arithmetic. For the former we rely on [8] and, for the latter we extend the work of [13] . Indeed, parallel fast arithmetic alone would not suffice to provide good speedup factors since subproduct tree operations require lots of calculations with low-degree polynomials.
To address the second challenge, we employ adaptive algorithms. That is, algorithms that adapt their behavior according to the available computing resources. For instance, each plain multiplication is performed by a single streaming multiprocessor (SM), since plain arithmetic is used for input polynomials of small sizes. Meanwhile, each FFT-based multiplication is computed by a kernel call, thus using several SMs. In fact, this kernel computes a number of FFT-based products concurrently. To evaluate our implementation of subproduct tree techniques, we measured the effective memory bandwidth of our GPU code for parallel multi-point evaluation and interpolation on a card with a theoretical maximum memory bandwidth of 148 GB/S, our code reaches peaks at 64 GB/S. Since the arithmetic intensity of our algorithms is high, we believe that this is a promising result.
All implementation of subproduct tree techniques that we are aware of are pure serial code. This includes [2] for GF (2)[x], the FLINT library [10] and the Modpn library [11] . Hence we compare our code against probably the best serial C code (namely the FLINT library). For sufficiently large input data, running on NVIDIA Tesla C2050, our code outperforms its serial counterpart by a factor ranging between 20 to 30. Experimental data can be found in Table 1 .
Application
In [14] , two of the co-authors of this note reported on the implementation of a bivariate polynomial system solver (based on the theory of regular chains and working with coefficients in small prime fields) partially written in CUDA and partially written in C. In that implementation, polynomial subresultant chains were calculated in CUDA while univariate polynomial GCDs were computed in C either by means of the plain Euclidean algorithm or an asymptotically fast algorithm).
The authors observed that about 90% of the overall running time of their solver was spent in univariate GCD computations. They also noted that most of these GCD calculations were using the plain algorithm since the degrees of the input polynomials were not large enough for using the FFT-based algorithm. These observations have lead to a CUDA implementation of the plain Euclidean algorithm which is reported in [8] . More recently, the same authors have put together in a single CUDA application the work reported in [14] and [8] , leading to a bivariate polynomial system solver which is mostly written in CUDA. Table 2 compares this latter with an implementation of our bivariate system solver (presented in [14] ) entirely written in C. Some of the input systems are random dense and the others are sparse. The number attached to each system name is the total degree of each input polynomial. For each system, the total number of solutions is essentially the square of that degree.
One can see that for a complex application like a polynomial system solver, a CUDA implementation can provide substantial benefit w.r.t. a pure C implementation. We should also point out that our CUDA implementation can be further improved. In particular, the top-level algorithm is still implemented in C and lots of data transfers are still taking place between the host (CPU) and the device (GPU). This performance bottleneck can be removed by using the latest programming model of CUDA.
