In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warpsynchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiplelength multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
Introduction
Recent Graphics Processing Units (GPUs), which have a lot of processing units, can be used for general purpose parallel computation. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. CUDA (Compute Unified Device Architecture) [2] is the architecture for general purpose parallel computation on GPUs. Using CUDA, we can develop parallel algorithms to be implemented in GPUs. Therefore, many studies have been devoted to implement parallel algorithms using CUDA [3] - [5] .
Applications require arithmetic operations on integer numbers which exceed the range of processing by a CPU directly is called multiple-length numbers or multiple-lengthprecision numbers and hence, computation of these numbers is called multiple-length arithmetic. More specifically, application involving integer arithmetic operations for multiple-length numbers with size longer than 64 bits cannot be performed directly by conventional 64-bit CPUs, because their instruction supports integers with fixed 64 bits. To execute such application, CPUs need to repeat arithmetic opera- * The preliminary version of this paper has been presented at the Third International Symposium on Computing and Networking (CANDAR 2015) [1] .
a) E-mail: yasuaki@cs.hiroshima-u.ac.jp DOI: 10.1587/transinf.2016PAP0027
tions for those numbers with fixed 64 bits which increase the execution overhead. Suppose that a multiple-length number is represented by w words, that is, a multiple-length number is 64w bits on conventional 64-bit CPUs. The addition of such two number can be computed in O(w) time. However, the multiplication generally takes O(w 2 ) time. Multiplelength multiplication is widely used in various applications such as cryptographic computation [6] , and computational science [7] . Since multiple-length numbers of size thousands to several tens of thousands bits are used in such applications, the acceleration of the computation of their multiplications is in great demand. Also, considering practical cases, a large number of multiplications are usually computed. Therefore, in this work, we target at the computation for many multiple-length multiplications of such size.
Main contribution of this paper is to present an implementation of multiple-length multiplication optimized for CUDA-enabled GPUs. The idea of our GPU implementation is to adopt warp-synchronous programming technique [8] . We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. Using these ideas, we propose a warp synchronous implementation of 1024-bit multiplication on the GPU. In addition, we show multiple-length multiplication methods for more than 1024 bits using the 1024-bit multiplication method as a subroutine. To reduce the number of multiplications, we use Toom-Cook method [9] .
The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation by GMP (GNU Multiple Precision Arithmetic Library) [10] . Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
There are GPU implementations to accelerate multiplelength multiplications. In papers [11] , [12] , GPU implementations of very large integer multiplications using FFT are shown. FFT-based multiplication methods have a small
Copyright c 2016 The Institute of Electronics, Information and Communication Engineers time complexity, however, it is efficient for quite large numbers that consists of more than several hundred thousand bits [13] . Zhao et al. proposed multiple-length multiplication only for 512 to 2048-bit integers on the GPU as one of library functions [14] . Since this implementation is based on School method that is a naive multiplication method, it is not efficient for more than several thousand-bit numbers. Kitano et al. proposed a GPU implementation of parallel multiple-length multiplication also based on School method [15] . In the implementation, load of each thread is equalized by reordering the computation of partial products. However, since warp divergence occurs frequently to perform this reordering, its parallel algorithm is not suitable for GPU architecture. Although several GPU implementations including the above implementations have been proposed, as far as we know, there is no GPU implementation that focuses on bulk execution of multiple-length multiplication. More specifically, many researchers have been devoted to develop and implement parallel algorithms for one input. Although we can obtain outputs for many different inputs by repeating them, their efficiency is not discussed. For example, the aim of the above works [11] , [12] , [15] is to accelerate the computation of one multiplication for quite large integers by many threads on the GPU. By repeating these methods for many inputs, the bulk execution can be performed. However, there is no research that is premised on the bulk execution. On the other hand, to accelerate the bulk execution of multiple-length multiplication, our proposed method uses the idea in [16] that shows a technique of more efficient GPU implementations for the bulk execution by considering the GPU architecture.
The rest of this paper is organized as follows. Section 2 briefly describes about CUDA architecture. Section 3 describes multiple-length multiplication methods. This section also by reviewing the warp shuffle functions considered in this work. In Sect. 4, our GPU implementation of multiplelength multiplication using warp synchronize programming is proposed. Experimental results are shown in Sect. 5. Finally, Sect. 6 concludes the paper.
Compute Unified Device Architecture (CUDA)
We briefly explain CUDA architecture that we will use. Figure 1 illustrates the CUDA hardware architecture. CUDA uses three types of memories in the NVIDIA GPUs: the global memory, the shared memory, and the registers [17] . The global memory is implemented as an off-chip DRAM of the GPU, and has large capacity, say, 1.5-12 Gbytes, but its access latency is very long. The shared memory is an extremely fast on-chip memory with lower capacity, say, 16-112 Kbytes. The registers in CUDA are placed on each core in the multiprocessor and the fastest memory, that is, no latency is necessary. However, the size of the registers is the smallest during them. The efficient usage of the global memory and the shared memory is a key for CUDA developers to accelerate applications using GPUs. In particular, we need to consider the coalescing of the global [18] , [19] . To maximize the bandwidth between the GPU and the DRAM chips, the consecutive addresses of the global memory must be accessed in the same time. Thus, threads should perform coalescing access when they access to the global memory.
CUDA parallel programming model has a hierarchy of thread groups called grid, block and thread. A single grid is organized by multiple blocks, each of which has equal number of threads. The blocks are allocated to streaming processors such that all threads in a block are executed by the same streaming processor in parallel. All threads can access to the global memory. However, as we can see in Fig. 1 , threads in a block can access to the shared memory of the streaming processor to which the block is allocated. Since blocks are arranged to multiple streaming processors, threads in different blocks cannot share data in shared memories. Also, the registers are only accessible by a thread, that is, the registers cannot be shared by multiple threads.
CUDA C extends C language by allowing the programmer to define C functions, called kernels. By invoking a kernel, all blocks in the grid are allocated in streaming processors, and threads in each block are executed by processor cores in a single streaming processor. In the execution, threads in a block are split into groups of thread called warps. A warp is an implicitly synchronized group of threads. Each of these warps contains the same number of threads and is executed independently. When a warp is selected for execution, all threads execute the same instruction. Any flow control instruction (e.g. if-statements in C language) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge, that is, to follow different execution paths. If this happens, the different execution paths have to be serialized. When all the different execution paths have completed, the threads back to the same execution path. For example, for an ifelse statement, if some threads in a warp take the if-clause and others take the else-clause, both clauses are executed in serial. On the other hand, when all threads in a warp branch in the same direction, all threads in a warp take the if-clause, or all take the else-clause. Therefore, to improve the performance, it is important to make branch behavior of all threads in a warp uniform. When one warp is paused or stalled, other warps can be executed to hide latencies and keep the hardware busy.
There is a metric, called occupancy, related to the number of active warps on a streaming processor. The occupancy is the ratio of the number of active warps per streaming processor to the maximum number of possible active warps. It is important in determining how effectively the hardware is kept busy. The occupancy depends on the number of registers, the numbers of threads and blocks, and the size of shard memory used in a block. Namely, utilizing too many resources per thread or block may limit the occupancy. To obtain good performance with the GPUs, the occupancy should be considered.
The kernel calls terminates, when threads in all blocks finish the computation. Since all threads in a single block are executed by a single streaming processor, the barrier synchronization of them can be done by calling CUDA C syncthreds() function. However, there is no direct way to synchronize threads in different blocks. One of the indirect methods of inter-block barrier synchronization is to partition the computation into kernels. Since continuous kernel calls can be executed such that a kernel is called after all blocks of the previous kernel terminates, execution of blocks is synchronized at the end of kernel calls. On the other hand, all threads of a warp perform the same instruction at the same time. More specifically, any synchronizing operations are not necessary to synchronize threads within a warp.
In CUDA, warp shuffle functions allow the exchange of 32-bit data between threads within a warp, which become available on relatively recent GPUs with compute capability 3.0 and above [17] . Threads in the warp can read other threads' registers without accessing the shared memory. The exchange is performed simultaneously for all threads within the warp. Of particular interest is the shfl() function, that is one of the warp shuffle functions. This function takes as parameters a local register variable x and a thread index id. As an example, consider the following function call shfl(x, 4). The shfl(x, 4) allows to transfer the data stored in the local register variable x from a thread whose id is 4 ( Fig. 2 (a) ). This function call corresponds to broadcasting a register variable in a thread to the other threads in a warp. We note that each thread has its own local register x, that is, each x cannot be accessed from other threads. As another example, consider the function call shfl(x, (id + 1)%w), where w is the number of threads in a warp. The function call performs data transfer like right circular shift between threads as illustrated in Fig. 2 (b) . In the similar way, the shfl(x, (id + w − 1)%w) allows to transfer data like left right circular shift (Fig. 2 (c) ). The above data exchange can be performed via shared memory. However, the latency of shared memory access is longer than that of the warp shuffle functions. Since the use of shared memory may cause for decreasing occupancy, if the warp shuffle functions can be used, they should be used.
Warp synchronous programming technique [8] is a parallel programming technique such that one warp is used as an execution unit. The characteristic of this technique is that any synchronous operations are not necessary. Usually, it is costly to synchronize execution of threads and communicate 
Multiple-Length Multiplication
In the following, we will represent multiple-length numbers as arrays of r-bit words. In general, r = 32 or 64 for conventional CPUs. Let R denote the bit-length of numbers and d be the number of r-bit words. Therefore, d = R r . For example, a 1024-bit integer consists of 32 words. Next, we will introduce several multiplication methods for such multiple-length numbers.
Suppose A and B represent two multi-length numbers. We are multiplying A by B and the result is stored in C, that is C = AB. To compute this multiplication, School method is often used. The algorithm of School method is shown in Algorithm 1. For simplicity, in the algorithm, the sizes of the multiplicand and the multiplier are the same and {x, y} denotes a concatenation of x and y. School method multiplies the multiplicand by each word of the multiplier and then adds up all the properly shifted results illustrated in Fig. 3 (a) . As illustrated in the figure, calculation of School method is performed in the row order and some storage needs to be allocated to store intermediate results that are partial products. In School method, intermediate data that are partial products need to be stored to the memory as described at line 6 in Algorithm 1 is necessary.
To avoid storing the partial products, Comba method [20] is used. The algorithm of Comba method is shown in Algorithm 2. According to the algorithm, the readers may think that it is more complicated than School method. However, the difference is only the order of multiplications of words and the number of multiplications of 
Algorithm 1 School method
c i+ j ← v 7: end for 8: c w+ j ← u 9: end for words is the same as illustrated in Fig. 3 . More specifically, calculation of Comba method is performed in the column order. In Comba method, intermediate data also has to be stored. However, the data corresponds to carry data for the next column. Since the size of the carry data does not depend on the size of numbers and it is only one or two words, its storage can be placed to the register. Table 1 shows the number of word-wise multiplications and memory access of School and Comba methods. From the table, the number of memory access, especially memory write, of Comba method is greatly reduced.
Toom-Cook method [9] is an algorithm for multiplying two numbers that reduce the number of multiplications compared with School method and Comba method. Toom-Cook method splits each number into multiple parts of equal length. A k-way Toom-Cook method (called Toomk) can do a multiplication by dividing an integer into k parts. Toom-3 reduces the number of multiplications to from 9 to 5.
For simplicity, we will explain how to perform Toom-3 as follows. Let us consider two numbers A and B to be multiplied are split into three parts of size
In Toom-Cook method, these formulae are considered as polynomials by replacing 2 R 3 with a variable x, as follows:
Algorithm 2 Comba method
If the coefficients C 0 , C 1 , C 2 , C 3 , and C 4 are determined, the product A × B can be computed from C(2 R 3 ). Since the number of coefficients is 5, we can determine them by solving 5 equations obtained by assigning 5 distinct values to x in C(x). For example, the following 5 equations are produced by x = 0, 1, −1, 2, ∞;
Note that C(∞) is equivalent to lim x→∞ C(x) x 4 , that is, C(∞) equals to the value of the highest-degree coefficient of C(x).
By solving these equations, we have the following values of coefficients:
On the other hand, from the definition we have
Using these formulae, C(0), C(1), C(−1), C(2), and C(∞)
are computed from A and B. After that, by computing C 0 , C 1 , C 2 , C 3 , and C 4 , the final result of the product can be obtained. According to the result, the number of multiplication excluding small constant numbers to be multiplied is reduced from 9 to 5. For k ≥ 2, we can perform Toom-k multiplication in a similar way.
Parallel Multiple-Length Multiplication for the GPU
This section presents the main contribution of this work. First, we propose a parallel 1024-bit multiple-length multiplication method, called Sum-rotate multiplication. After that, we show Toom-k multiplication methods for more than 1024 bits with the 1024-bit multiple-length multiplication method as a sub-routine.
Sum-Rotate Multiplication
We propose a parallel 1024-bit multiple-length multiplication method using a warp. The method is based on warpsynchronous programming. In the following, w threads, which correspond to one warp, are used and work in parallel without any barrier synchronize operations since threads within a warp execute the same instruction and synchronize for each instruction. Also, the proposed parallel multiplelength multiplication does not use any shared memory. It is a parallel algorithm that parallelizes School method basically, called Sum-rotate multiplication. To achieve this, we employ warp shuffle functions as described in Sect. 2. More specifically, data exchange methods, broadcast and right/left circular shift, as shown in Fig. 2 using warp shuffle function shfl() are utilized. The detail of the parallel algorithm are presented next. In our approach, a product C = (c 2w−1 , . . . , c 1 , c 0 ) of two w-word numbers A = (a w−1 , . . . , a 1 , a 0 ) and B = (b w−1 , . . . , b 1 , b 0 ) is computed, where the size of each word is 32 bits. Since w = 32 unless the value of w is not changed for changing the GPU architecture in the future, this algorithm supports a multiplication of two 1024-bit numbers.
Let us consider how to perform the computation using multiple threads. A simple idea is to assign threads column by column as illustrated in Fig. 4 . In the figure, threads are assigned to two columns to balance the computation load of each threads. However, since threads have to switch columns in distinct timings during the computation, warp divergence, described in Sect. 2, occurs. This parallel approach is not suitable for GPUs.
On the other hand, in the proposed approach, w threads, that correspond to one warp, are used. Each thread is assigned to one of the partial products in each column. In the proposed approach, since each thread takes partial products shifting to the upper digits row by row, it is necessary to obtain the partial products, except the carry, from a thread assigned to the upper digits. To achieve this, we use the inter-thread right circular shift described in Sect. 2. in each row, thread 0 obtains the final product of c j . According to Fig. 5 , a thread assigned to the lowest digits can obtain the lower words of the final product c 0 , . . . , c w−1 for each row. On the other hand, the upper words of c w , . . . , c 2w−1 are finally computed by thread 0, . . . , thread w − 1, respectively. After completing the multiplication, 2w words of the final results are placed such that thread i has two words c i and c i+w to store the results to consecutive address of the global memory using coalescing access in parallel.
The details of Sum-rotate multiplication are shown in Algorithm 3. Each step of the algorithm is executed by w threads in parallel. First of all, in lines 3 and 4, each thread loads one word of each from A and B stored in the global memory and stores them to its own registers a and b. After that, the multiplication is performed row by row as illustrated in Fig. 5 . In line 6, thread j broadcasts b j to local register b using the warp shuffle function to compute the product a · b in the next step. In line 7, partial products are computed including the addition of the carry from the upper digits. Each thread obtains the partial products except the carry from a thread assigned to the upper digits as the carry for the next digits by right circular shift of register v in line 8. In line 9, product c i of the final product computed by thread 0 is transferred to the right thread using right circular shift of register c . Next, thread w − 1, that is assigned to the leftmost thread in Fig. 5 , set registers c and v to v and 0, respectively. This is for the right circular shift operations in lines 8 and 9. Since this operation is performed only by thread w − 1, warp divergence occurs, but the effect to the performance seems to be very small. After that, each thread obtains the value of the next digits in line 14.
After for loop, each thread has the lower digits of the final products c 0 , . . . , c w−1 , respectively. At that time, the upper digits c w , . . . , c 2w−1 has not been computed yet since each thread still has the carry. Therefore, the while loop in lines 16 to 19, carry propagation is performed using left circular shift until any threads have no carry. In order to check whether any threads have no carry, we use warp vote function any() that evaluates truth values given from all threads of the warp and return non-zero if any of the truth values is non-zero [17] . This while loop is iterated at most w − 1 times. After the loop, since thread i has two words c i and c i+w , they are stored to the global memory with coalescing access in lines 20 and 21.
Toom-k Multiplication with Sum-Rotate Multiplication
In our GPU implementation, we use Toom-k multiplication method for more than 1024-bit numbers. For simplicity, we explain our GPU implementation with Toom-3 as follows.
The GPU implementation consists of two kernels. In the first kernel, five blocks that consist of one warp, that is five warps in total, are assigned to each multiplication. Each warp computes one value of C(0), C(1), C(−1), C (2) , and C(∞) shown in Sect. 3 with Sum-rotate method. When the size of multiplication is more than 1024 bits, each warp sequentially performs Comba method by repeating Sum-rotate method. In the second kernel, one block consisting of one warp is assigned to every 32 multiplications. Each thread computes one multiplication using the values obtained in the first kernel. Namely, one thread computes the values of C 0 , . . . , C 4 in serial and then, the final result of the multiplication is computed. In the second kernel, each thread runs for individual multiplications, but warp divergence does not occur since all threads execute the same instructions. In the above two kernels, each block is composed of only one warp so that the occupancy is increased by reducing the number of warps in a block. Moreover, all the data stored in the global memory is arranged so that the memory access can be performed with coalesced access, without going into detail. For k ≥ 2, the GPU implementation performs Toom-k multiplication in a same way. Let us evaluate the number of multiplications and memory access in the above methods that are important factor for the computing time. Table 1 shows a comparison of the number of multiplications and memory read/write for multiplying two w-word numbers in our implementation. Regarding as the number of multiplication, School and Comba methods are the largest among them. In Toom-k, when k is larger, the number of multiplications is smaller. Although every memory access is performed by coalesced access, memory access time is much longer than the other instructions such as arithmetic operations including multiplication and addition. Therefore, the computing time does not always become shorter for larger k. Hence, we should find the best parameter k that minimizes the computing time for the bit size of numbers. Algorithm 3 Sum-rotate multiplication using a warp 
Optimization of the Code for Arithmetic
As mentioned in the above, arithmetic with larger integers having more than 64 bits is necessary. In C language, however, there is no efficient way of doing such arithmetic because C language does not support operations with the carry bits. To optimize the arithmetic with more than 64-bit integers, therefore, a part of the code is written in PTX [21] that is an assembly language for NVIDIA GPUs and can be used as inline assembler in CUDA C language. PTX supports arithmetic operations with the carry bits.
Experimental Results
The main purpose of this section is to show the experimental results. In order to evaluate the computing time for multiplelength multiplication, we have used NVIDIA GeForce GTX 980, which has 2048 cores running on 1.216GHz [22] . The source program of the GPU implementation is compiled by nvcc version 6.5.13 with -O2 and -arch=sm 50 options. In the following, the computing time is average of 10 times execution and the computing time of the GPU does not include data transfer time between the main memory in the CPU and the device memory in the GPU. The reason that data transfer time between the main memory in the CPU and the device memory in the GPU is excluded is that in practical case this bulk execution of multiplication is performed before and/or after some process on the GPU. Therefore, input and output data is not always located in the main memory. First, we evaluate the performance of Sum-rotate multiplication based on warp-synchronous programming technique. We have implemented the single thread implementation such that each thread computes one multiplication. This implementation is based on the idea proposed in [16] . In the implementation, there is no warp divergence since all threads execute the same instructions, that is, this implementation is also based on warp-synchronous programming technique. In addition, to evaluate the effect of the use of warp shuffle function, we have implemented a multiplication method with the shared memory instead of the warp shuffle function. Table 2 shows the computing time when 100000 multiple-length multiplications of 1024 bits are computed. In the above implementations, every block has 32 threads, that is, one warp. According to the table, we can find that one warp implementation is faster than the single thread implementation. For data communication within threads, use of warp shuffle functions is more effective than that of shared memory. Table 2 also shows the computing time of the sequential school method and parallel column-based multiplication with shared memory. These two methods are mainly used in the existing methods proposed by Zhao [14] and Kitano [15] . As a result, the proposed method is 10.2 and 2.1 times faster than the existing two methods, respectively. Figure 6 shows the computing time of the GPU implementations of Comba, Toom-k (k = 2, . . . , 9) when 10240 multiplications are computed. Recall that each implementation uses Sum-rotate method with warp shuffle functions as a sub-routine for 1024-bit multiplication. Although the graph may be unclear, Comba method is faster for less than Fig. 6 The computing time of GPU implementations in milliseconds for 10240 multiple-length multiplications 10240 bits. For 10240 or more bits, Toom-k is faster than Comba method. Also, when the number of bits is larger, Toom-k for larger k (k ≤ 8) is faster. However, Toom-9 is not the fastest for any number of bits because the overhead such as memory access cannot be ignored even though the number of multiplication is less than the other methods from Table 1 .
We have also used Intel PC using Xeon X7460 running on 2.6GHz to compare the performance of the GPU implementation with sequential algorithms on the CPU. In the CPU implementation, we have utilized GMP version 6.1.0. The source program is compiled by gcc version 4.8.3 with -O2 option. Table 3 shows the comparison between CPU and GPU implementations for the computing time in milliseconds when 10240 multiple-length multiplications are computed. The table also shows the multiplication methods that have been used. In the CPU implementation, when the GMP library was installed, the best method has been selected according to the execution environment. On the other hand, the best method shown in Fig. 6 has been used in the GPU implementation. We note that in 64-, 128-, 256-, and 512-bit multiplications, we cannot directly use the sum-rotate method shown in Sect. 4 since 32 threads in one warp compute one 1024-bit multiplication. Therefore, in such smaller size of bits, several multiplications are concurrently computed by one warp to use all the threads in a warp. For example, in 256-bit multiplications, four 256-bit multiplications are simultaneously computed by assigning 8 threads in a warp to each 256-bit multiplication. According to the table, using the proposed GPU implementation, the comput-ing time can be reduced by a factor of 19.65 to 52.14. Table 3 also shows the computing time of GPU implementation using CUMP library version 1.0.1 [23] that supports multiple-length multiplication on the GPU. Since CUMP uses the local memory whose size is limited, 65535-bit multiplication could not be executed. Also, in the GPU implementation with CUMP, each multiplication is computed with school method by one thread sequentially. As a result, the proposed method can compute multiplication 1.6 to 50 times faster than CUMP.
Conclusion
In this paper, we have presented a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt warp-synchronous programming technique. Using this idea, we have proposed Sum-rotate multiplication of two 1024-bit integers. We assign each multiple-length multiplication to one warp that consists of 32 threads. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the single CPU implementation using GNU multiple precision arithmetic library. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speedup factor of 21 for 65536-bit multiple-length multiplication.
