Abstract. We present a highly scalable algorithm for multiplying sparse multivariate polynomials represented in a distributed format. This algorithm targets not only the shared memory multicore computers, but also computers clusters or specialized hardware attached to a host computer, such as graphics processing units or many-core coprocessors. The scalability on the large number of cores is ensured by the lacks of synchronizations, locks and false-sharing during the main parallel step.
Introduction
Since the emergence of computers with multiple processors, and nowadays with several cores per processor, computer algebra systems have been trying to take advantage of such computational powers to reduce execution timings. As sparse multivariate polynomials are intensively present in many symbolic computation problems, the algorithms of the basic operations on these objects, such as multiplication, have been designed to use the available processors in workstations. These algorithms depend on the polynomial representation in main memory. The multivariate polynomials are usually stored in a distributed or recursive format. In the distributed format, a polynomial is a list of terms, each term being a tuple of a coefficient and an exponent. In the recursive form, a polynomial is considered as an univariate polynomial whose coefficients are polynomials in the remaining variables.
When the inputs are sparse multivariate polynomials, only the naive schoolbook product, that is the pairwise term products, is usually more optimal in practice than the asymptotically fast multiplication algorithms. Several parallel algorithms have been proposed for modern parallel hardware. An algorithm for the recursive representation has been designed using a work-stealing technique [1] . It scales at least up to 128 cores for large polynomials. Several algorithms [2] , [3] have been designed for the distributed format for a parallel processing. The algorithm due to Monagan [2] uses binary heaps to merge and sort produced terms but its scalability depends on the kind of the operands. Indeed, dense operands shows a super-linear scalability [2] , [1] . If the number of cores becomes large, no improvement is observed because each thread should process at least a fixed number of terms of the input polynomials. This behavior is due to the fact that Monagan's algorithm benefits from the shared cache inside the processor. For sparse operands, a sub-linear speedup on a limited number of cores and a regression above these number of cores is observed but their algorithm only focuses on single processor computers with multiple cores sharing a large cache. The working threads have a private heap to sort their owned results and share a global heap for computing the output polynomial. The access by a thread to the global heap requires a lock statement to avoid a race condition. This lock statement avoids to obtain a good scalability on a large number of cores. The algorithm designed by Biscani [3] and implemented in the Piranha algebraic manipulator [4] has no limitation according to the number of cores. The work is split in closed intervals based on the hashed value of the operands' terms and pushed in a list of available tasks. Since the result of the different tasks may overlap, the access to the two lists of available and busy tasks is controlled by a mutual-exclusion lock to avoid a race condition. This single mutual-exclusion lock becomes a bottleneck when the number of cores becomes very large. Biscani's algorithm assumes that the cost of the access to the global memory for the result is the same for all the cores and does not depend on the memory location while two threads executed on two different processors may write successively the result to the same location.
Several new hardware processing units have appeared in the last decade, such as the multi-core processors in desktop computers or laptops, GPU with hundred or thousands of elementary processing units or specialized accelerators. In these different hybrid architectures, the memory access times depend on the memory location because each processor accesses faster to its a own attached global memory. Cluster of nodes embedding all these different processors are available and may be used to perform the multiplication of sparse polynomials. We present a new algorithm for sparse distributed multivariate polynomial targeting these different architectures in section 2. Using a small specialization of one step inside this algorithm due the constraints of the different hardwares, we adapt it to a cluster of computers in section 3 and to specialized many-core hardware in section 4. Benchmarks for these computers are presented in section 6.
Algorithm on Shared Memory Computers
The designed algorithm should minimize the number of synchronizations or locks between threads in order to obtain a good scalability on many cores. Indeed, many synchronizations or locks are required only if different threads compute the terms of the result which have the same exponent. To avoid any lock or synchronization during the computations of resulting terms, a simple strategy is that each thread computes independent terms. Computation of independent terms is very easy if a recursive data structure for the polynomials is used, as shown in Wang [5] and Trip [6] but if a distributed form is used then this task is much more tricky. The proposed algorithm 1 requires two major steps. A preliminary step is required to split the work between threads to avoid any communication between the threads during the computational task.
Let c be the number of available cores and the same number of computational threads. Let us consider the polynomials in m variables x 1 , . . . , x m ,
where x corresponds to the variables x 1 , . . . , x m , the a i and b j are numerical coefficients, and the m-dimensional integer vectors α i and β j are the exponents. These polynomials are stored in a sparse distributed format and their terms are sorted with a monomial order ≺.
The product P of A and B is the sum of the terms P i,j = a i b j x γi,j where γ i,j = α i + β j for i = 1 . . . n a and j = 1 . . . n b . We can construct the n a × n b matrix of the sum of exponents, called pp-matrix following Horowitz' denomination [7] , to understand how the work is split between the threads. In fact, this matrix ( Fig. 1 ) is never stored in memory during the execution of the algorithm due to its size. As each thread must compute independent terms, each thread must process all the pairwise term products P i,j of the pp-matrix which have the same value for γ i,j . Since the possible values of all the γ i,j are in the interval [γ 1,1 , γ na,n b ], this interval may be split into subintervals which are processed by the different threads. If this interval is split in equal subintervals, the load-balancing between the threads will be very poor. Several values γ i,j of the pp-matrix are selected and used to split into subintervals in order to obtain a better load-balancing. Since a simple and fast method to select these values does not guarantee that duplicated values are not selected, duplicated selected values are removed and the remaining values γ i,j are the bounds of the subintervals. Left-closed, rightopen subintervals are required to avoid that all the γ i,j , which have the value of one of the bounds of the interval, remain in a single subinterval. The exponent γ end is introduced in order to have a same interval type (left-closed, right-open) for the last subinterval which must contain the value γ na,n b and γ end is any exponent greater than γ na,n b according to the monomial order.
The algorithm begins with the construction of the set S which consists of the selection of n s values (or exponents) inside Γ . The selection method must always select the exponents γ 1,1 and γ end in order to be sure that all pairwise term products will be processed in the next step of the algorithm. n s needs to be provided as an input of the algorithm and its value must remain very small in front of the size of Γ since the first step of the algorithm needs to be fast. The way to select the n s exponents will be discussed in section 5. This set S is then sorted according to the monomial order and its duplicate values are removed in order to obtain the new subset S. The number of elements of the set S is noted n s . This first step is very fast and could be computed by a single thread. Of course, this step may be parallelized using a parallel sorting algorithm. If we define the set Γ = {γ i,j | 1 ≤ i ≤ n a and 1 ≤ j ≤ n b } {γ end }, then the first step produces the following set
After this preliminary step, every γ i,j could be located inside a single interval [S k , S k+1 [ and all the γ i,j with the same value are located in the same interval. The threads may process the n s −1 intervals of exponents at the same time since they compute independent terms of the result. Indeed, if a thread processes the interval [S k , S k+1 [, it computes the summation of the selected terms P i,j such that S k ≤ γ i,j < S k+1 . So the second step consists in computing the resulting terms using a parallel loop over all the n s − 1 intervals. As each interval may have different execution times, due to a variable amount of P i,j involved, the work should be balanced between the cores using a number of intervals greater than the number of cores. The load-balancing may be done using a dynamic scheduling, such as work-stealing [8] , which does not require any bottleneck synchronization.
During this second step, each thread needs to check if the entry γ i,j of the pp-matrix is included inside its own current interval in order to process it or not. If the thread checks each entry, each thread will perform n a n b comparisons which are very inefficient. As the pp-matrix has an ordered structure, as γ i,j < γ i+1,j and γ i,j < γ i,j+1 , this property may be exploited to find efficiently the necessary entries of the intervals. For each line of this matrix, only the location of the first and last element, which corresponds to the first and last exponent processed by the thread, should be determined. So each thread needs to find the edge of the area of terms that it should process for the current interval. This edge consists almost of two lines, which correspond to the first exponents and last exponents on each line, as shown in the figure 2(b). This work is done by the function FindEdge. This algorithm consists in storing the location of the first, respectively last, column j where S k ≤ γ i,j , respectively γ i,j < S k+1 , in two arrays L min and L max of size n a . Its complexity is O(n a + n b ) because, when the thread processes the line i + 1 of the matrix, it does not start at the column 1 but at the found column in the previous line i, as γ i,j < γ i+1,j . n s needs to be kept small because each thread will have to process (n a + n b )(n s − 1)/c exponents to compute these arrays if the work is well balanced. As each thread Algorithm 1: mul(A, B, n s ). Return A × B using at most n s intervals.
Input: ns : integer number of intervals Input: monomial order ≺ Output: C = nc k=1 c k x γ k // First step 1 S ←Compute ns exponents γi,j = αi + βj using an almost regular grid over the pp-matrix associated to A and B 2 S ← sort S using the monomial order ≺ 3 remove duplicate values from S // S has now ns sorted elements // Second step 4 Initialize an array D of ns empty containers for the result
has its own arrays L min and L max , the additional memory usage requirement for these arrays is only 2n a c integers during the second step. However, the storage of these arrays is not required if it is possible to combine this function with the function MergeSort but this depends on the algorithm used in that function.
Using its own arrays L min and L max , each thread computes the summation of its own terms P i,j = a i b j x γi,j using any sequential comparison-based sorting algorithm (function MergeSort in the algorithm) and store them in a container D k associated to the corresponding interval. No concurrent writing or reading access occurs to the same container because threads need to read or write data only about their own current interval. Johnson proposes a sequential algorithm [9] which computes the result using a binary heap. If the multiplication produces O(n a + n b ) terms, only O(n a n b log min(n a , n b )) comparisons of exponents are required. Monagan and Pearce have improved this algorithm with a chained heap [10] . When all threads have finished to process all the intervals, a simple concatenation of the containers is performed to obtain the canonical form of the polynomial as the containers of D are already sorted according to the sorted intervals.
Adaptation to Computer Cluster
As the second step could be computed in independent parallel tasks, our algorithm could be easily adapted to a cluster of computational nodes. Cluster of computer nodes offers a distributed memory architecture where the access time to the memory located on the other nodes is several magnitude order greater Algorithm 2: mul(A, B, n s ). Return A × B using at most n s intervals on a cluster of N computer nodes. 
// similar to the step 2 of Alg. 1
than the access to the local memory. A message passing paradigm, such as MPI standard, should be used to perform the communications between the nodes. But a pure MPI application does not take advantage of the multiple cores available inside a node. An hybrid (multi-threading+MPI) approach must be used in order to reduce the cost of the communication and to improve the parallel scheduling of the second step of the previous algorithm. We assume that the operands are located on a single node and the result should be stored on this node. So the operands should be broadcast to the other nodes. If the operands are located on different nodes, each node broadcasts its content of the operands to the other nodes. A simple parallel scheduling could use the master-slave paradigm where a node is dedicated to be the master and other nodes request intervals to this master node, process the intervals and send the result to the master. Good loadbalancing in this context requires to have many intervals which involve many communications. Furthermore, the result may generate large messages which require to use the Rendezvous protocol and imply a waiting for the other slaves.
In order to limit the number of communications, a node should process consecutive intervals and send a single result for all these sorted intervals. Inside this node, the same parallel scheduling as in the shared memory context may be chosen to distribute the work between the threads. To reduce to the minimal number of communications, the number of group of consecutive intervals is chosen to be equal to the number of nodes. But an extra step is introduced to perform a good load-balancing between all the nodes. This extra step requires to compute the number of multiplications or operations required to process each interval. The cumulative summation of the number of operations is performed in order to create the group of consecutive intervals with almost the same number of operations. The master node will also compute a part of the result. Other nodes send their results back to the master node. The algorithm 2 shows the processing steps required to perform the multiplication on the cluster of nodes.
Adaptation to Specialized Many-core Hardware
GPU and other dedicated cards are able to perform general-purpose computations but they have dedicated memory. So the same adaptation as for the cluster of computers is done for the data transfer between the host memory and the GPU memory. The scheduling is easier than on the cluster since it can be done by the processor of the host computer. The values O k are not computed to perform the scheduling. If several many-core hardware are connected to the host computer, only the bounds l 1 and l 2 are sent to the different cards. As the memory transfer may be expensive, host processor may compute other data, e.g. put the result in a canonical form, in order to overlap the memory communication.
Choice of the Set S
The set S should be chosen in order to balance the work as better as possible between the intervals, even if a perfect work balancing is impossible without computing all the elements of the pp-matrix. The choice of the elements of S could be done using an almost regular grid over the pp-matrix. This method is very fast and very simple to implement. In order to obtain the fixed n s elements, our grid is defined using the following rules. are the distances between two selected points on the same line or column in the pp-mtarix. The value j 0,i is used to avoid the effect of the large strip. Due to the integer division, some elements in the last column and in the last line are selected to avoid a too large strip in the last part of the matrix. Figure 2(a) shows an example of a computed grid and the figure 2(b) shows the edge of the intervals computed by the different threads. Other sort of grids may be used instead of our selected grid but they have insignificant impact on the performance of the algorithm. For example, the pp-matrix may be divided in n s submatrices and a random exponent may be chosen inside each submatrix. Instead of selecting equidistant points on the line i of our grid, non-equidistant points may be selected on the line i to generate another sort of grid. We have tested these grids and the differences of the execution time of the product are less than 1.5% on a multicore multiprocessor computer using these grids.
To achieve maximal performance, the value n s or l should be chosen dynamically according to the number of available cores and/or to the number of terms of the polynomials. As n s should remain small in order to reduce the time spent in the first step and in the function FindEdge of the algorithm, the parameter should be fitted only to the number of available cores. The parameter l is preferred instead of n s for the tuning because a simple linear variation on this parameter is possible. Its value must be tuned only once, for example at the installation of the software. Of course, for small polynomials, the tuned value l may be too large and must be reduced in order to have enough work for each thread.
Benchmarks
Three examples are selected to test the implementation of our algorithm. The two first examples are due to Fateman in [11] and Monagan and Pearce in [2] .
-Example 1 : f 1 × g 1 with f 1 = (1 + x + y + z + t) 40 The scalability of our algorithm depends on the number of intervals n s , the size of operands (n a , n b ), and the number of cores (c) available on the computer. We have implemented two kinds of MergeSort algorithms for the parallel step to show its independence with respect to this algorithm. As in Monagan and Pearce, a chained heap algorithm is implemented to perform the summation of the terms but it does not include any lock as the binary heap is accessed only by one thread. This algorithm is noted heap in the tables and figures. The second sorter algorithm, noted tree, uses a tree data structure in which each internal node has exactly 16 children. At each level of this tree, four bits of the exponents are used to index the next children. If the exponents are encoded on 2 d bits, our tree will have 2 d−2 levels. The tree associated to each interval is converted to a distributed representation at the end of the algorithm in order to obtain a canonical distributed form of the polynomial. This container uses more memory but its complexity to insert all the elements is only in O(2 d−2 n a n b ). This practical complexity is better if the exponents are packed since many inserted terms have common bits inside their exponents. The exponents of the polynomials are packed on a 64-bit unsigned integer in the implemented algorithm.
To fit the value l, and so n s , on the available hardware, we generate randomly two sets of 280 sparse polynomials in several variables with different numbers of terms. The number of variables of these polynomials is from 4 to 8 and the number of terms varies from 10000 to 60000 terms. The products of the two sets of polynomials are performed with different values of l. An histogram is built with the values of l, whose time of execution does not differ more than 10% from the best time for each product.
Shared Memory Multiprocessors
As processors with multiple cores are now widely available in any computer, the three examples are executed on a computer with 8 Intel Xeon processors X7560 running at 2.27Ghz under the Linux operating system. Each processor has 8 physical cores sharing 24 Mbytes of cache. This computer has a total of 256Gbytes of RAM shared by its 64 cores. The parallel dynamic scheduling of the second step of the algorithm is performed by the OpenMP API [12] of the compiler. As the memory management could be a bottleneck in a multi-threading multiplication of sparse polynomials, the memory management is performed by the Intel threading building blocks library [13] , noted TBB, which provides a scalable allocator instead of the operating system C library.
The first step has been to tune the parameter n s or l on this hardware. Figure 3 shows the histogram of the number of best execution time according to the parameter l using the heap algorithm. For small value of l, not enough parallelism is provided to get good execution time. We fix the parameter l to 64 in order to perform the benchmarks on this computer. Fig. 3 . Number of products of the two sets of 280 randomly generated sparse multivariate polynomials using different values l, whose execution time does not differ more than 10% from the best time.
Our algorithm, noted DMPMC, is compared to the computer algebra systems Maple 16 [14] , Piranha [4] and Trip [6] . In all these software excepted Piranha, the coefficients of the polynomials are represented with integers using a mixed representation. For the integers smaller than 2 63 − 1 on 64-bit computers, hardware integers are used instead of integers' type of the GMP library [15] . The multiplication and additions of the terms use a three word-sized integers accumulator (a total of 192 bits) for the small integers. The same optimization is used in Maple [10] and Trip. The timings for Maple 16 are the timings reported by the multiplication step of the SDMP which excludes the DAG reconstruction of the polynomial. Piranha uses only the GMP integers and allocates memory with the same scalable memory allocator TBB. Two times are reported for Trip. The dense time is for the optimized dense recursive polynomial data structure (POLPV) and the sparse time is for the optimized sparse recursive polynomial data structure (POLYV). Table 1 shows the execution times of the three examples on the 64 cores computer. Even if our chained heap is less tuned for the dense polynomials on single core, our algorithm for distributed representation scales with the same behavior as the recursive algorithms of Trip. We define the speedup as T 1 /T c and the efficiency as T 1 /(c × T c ) where T c is the execution time on c cores. Figure 4 shows the speedup for the different implementations and confirms that the SDMP algorithm of Maple 16 [14] focuses only on a single multi-core processor with large shared memory cache. Indeed, the efficiency of the SDMP algorithm drops to less than 0.6 above 16 cores for the three examples while the efficiency of our algorithm remains above 0.7 on 64 cores. The limited scalability of Piranha is confirmed due to the access to its two global lists shared between all the threads. The kind of MergeSort algorithm has only a significant impact on the execution time but not on the scalability of our algorithm. The small differences observed on the speedup between the tree and heap algorithm come from the different number of memory allocations since the tree version requires more memory allocation. Although we have used a sequential implementation of the first step, its duration remains insignificant. The computation of the edge (function FindEdge) by each thread takes only a few percent of the total time.
Distributed Memory Computers
In this second set of experiments, algorithm 2 is implemented using the hybrid approach OpenMP+MPI on the MesoPSL cluster with 64 nodes interconnected with a QDR InfiniBand network for a total of 1024 cores. Each node embeds two Intel E5-2670 processors sharing a total of 64 Gbytes of RAM between the 16 cores of the node. Quadruple precision floating point numbers have been used for the polynomials coefficients instead of variable size integers coefficients in order to simplify the exchange of the coefficients between the nodes. the speedup of the algorithm on this cluster with the tuned parameter n s = 8 √ c where c is the total number of cores. The speedup is defined as T 1,OpenMP /T n,MPI where T 1,OpenMP is the execution time on one core of a single node using the OpenMP implementation of the algorithm 1 and T n,MPI is the execution time on n nodes using the hybrid OpenMP+MPI implementation of algorithm 2. The limitation of the 2 GB maximum message size in the MPI implementation of the cluster requires to implement a custom gather operation using the send/receive messages in order to collect the result on the root node which have significant impact on the timings. This bottleneck is especially visible on the large result of the second example which contains more than 300 million terms since half of the time is spent to transfer the result from the slave nodes to the master node. However, the algorithm always continues to scale according to the number of available cores. In all cases, the algorithm scales well up to at least two hundred cores. Similar behaviors are obtained if double precision coefficients are used instead of quadruple precision coefficients.
Specialized Many-core Hardware
To test the algorithm on a many-core hardware, the benchmarks are performed on a Nvidia Tesla S2050 computing System based on the Fermi architecture interconnected through two links to a host computer using a PCI Express 2 16x controller. The host computer is the same computer as for the shared memory benchmark. The Nvidia Tesla S2050 consists of four Fermi graphics processing units. So two GPU cards share the same links to the host controller. Each GPU embeds 14 streaming multiprocessors and 3 GBytes DDR5 memory. Each streaming processor has 32 cores running at 1.15 Ghz and is able to schedule two groups of 32 threads simultaneously. A total of 448 cores is available per GPU. The kernel function running on the GPU is implemented using the CUDA programming model [16] .
Only the tree algorithm for the MergeSort function is implemented since the heap version is not well adapted for the CUDA programming model. Indeed, an interval is not processed by a single thread but it is processed by a group of t threads, called a block in the CUDA terminology. So for each interval, a group of threads constructs a temporary tree in order to merge and sort the terms. Atomic instructions have been used in order to reduce the number of synchronization inside the group of threads. To avoid the divergence of the execution path of the threads inside a group, this one handles the terms line after line. Only t elements of the array L min are stored in the shared memory of the GPU at the same time in order to reduce the memory consumption and to avoid access to the global memory. In our implementation, the reconstruction of the canonical distributed representation from the tree is performed on the host processor with one host thread per GPU and overlaps the computations by the GPU. A simple static scheduling is performed : ns 5g intervals are processed at a same time by a GPU if g GPUs are used for the computations. So each GPU executes 5 times the kernel function. More sophisticated scheduling may be done by overlapping memory transfer between the host and GPU memory. As no version of the GMP library is available for the GPU side, double-precision floating point numbers have been used for the coefficients on CPU and GPU. Table 2 shows the execution time and speedup obtained on the GPU with different number of threads (t) inside the group. As the kernel function needs to be transferred on the card by the CUDA driver, the timings are the average of eight executions without two first useless executions. The group of 64 threads has the best execution time for the three examples. The scalability on several GPUs is not linear for several reasons. The major reason is that the four cards share the two links to the host. A better dynamic scheduling should be done using an optimized heterogeneous scheduler, such as StarPU [17] , and the computations on the card are not overlapped by the memory transfer. For comparison, the execution time of Trip, Piranha and DMPMC on the same host computer are reported in Table 3 with the same kind of numerical coefficients.
Conclusions
The proposed algorithm for the multiplication of sparse multivariate polynomials stored in a distributed format does not have any bottleneck related to the numbers of cores due to the lack of synchronization or locks during the main parallel step. But it requires a preliminary one time step to tune the size of the grid to the targeted hardware. The range of targeted processor units is wide for our algorithm. The only drawback comes from the time to transfer data between nodes on the distributed memory systems due to the limited performance of the interconnection network. It can use any available fastest sequential merge and sort algorithm to generate the terms of the result and can benefit from any efficient dynamic scheduling. A more appropriate algorithm for this merge and sort step may be designed for the GPU hardware in order to take into account all features of these specialized hardware.
