Introduction
Large operands are widely used in scientific, cryptography, multimedia, and signal processing applications. Multiplication is one of the most used arithmetic operations in these applications [1, 2] . General purpose processors do not contain large multipliers. To compensate for the lack of the hardware, special software routines or multiple-precision arithmetic libraries can be used to perform the multiplication of large operands (The GNU Multiple Precision Arithmetic Library). These routines decompose the large operands into standard size suboperands and perform multiple suboperand multiplications; the products of these multiplications are aligned and summed to generate the large product. There are algorithms faster than this simple method [3, 4] , however, they are constrained to use the standard size multipliers too. Thus, the software only approach becomes extremely time-consuming when a vast number of large multiplications are executed in applications. Therefore, there is a genuine need for large multipliers that work fast and use as little logic as possible.
The recent work on the design of large multipliers focuses on field-programmable gate array (FPGA) implementations due to their rapid design and flexibility advantages [5] [6] [7] [8] [9] [10] [11] [12] . A brief discussion of the previous work is provided.
In [5] , hybrid sequential large multipliers are designed using Broadcast (decomposition method's implementation) and Karatsuba-Ofman (KO) multiplier blocks. Various combinations of these multiplier blocks are tried out to implement 256-bit multipliers. Among these implementations, the one that uses four hierarchical stages of KO multipliers is the fastest, but uses the most logic resources; the implementation that uses two hierarchical stages of two broadcast multipliers is the slowest.
In [6] , a combinational large multiplier and squarer designs that use the decomposition method are presented. Both designs use fast adder trees to sum the partial products generated by suboperand multiplications; 20-bit to 85-bit multiplier implementations are mapped on Spartan-3 FPGAs.
In [7] , another combinational large multiplier design that uses the decomposition method is presented. The design method exploits the structure of the arithmetic slices and the fast carry chains provided in Virtex 4 FPGAs; 16-bit to 221-bit implementations of the proposed design are mapped on the FPGAs.
In [8] , a bit serial large multiplier design is presented. The design uses carry save adders to perform the addition of partial product bits. The product bits are converted on the fly from borrow save format to two's complement format; 128-bit to 1024-bit implementations are mapped on Virtex 2 FPGAs.
In [9] , truncated large multiplier designs for high-precision floating-point multiplication are presented. The truncated multipliers can be used by applications that tolerate truncation error. The study modifies the KO method and applies it to both multiplication and squaring operations; 23-bit, 52-bit, and 112-bit pipelined implementations are synthesized and mapped on Virtex 4 FPGAs.
In [10] , a large multiplier design that uses a modified KO method for high-precision floating-point multiplication is presented. A 128-bit quadruple-precision mantissa multiplier has been constructed using one 66-bit and two 65-bit multipliers instead of four 64-bit multipliers.
In [11] , two combinational signed-large multipliers designs for FPGAs are presented. The first design uses symmetric multiplier blocks, while the second one uses asymmetric multiplier blocks; 51 by 68 to 51 by 190 multiplier implementations are mapped on Virtex 5 FPGAs.
In [12] , three sequential large multiplier designs for FPGAs are presented. The paper uses the modified decomposition method and presents the speed-area tradeoff among those designs; 256-bit to 2048-bit implementations are synthesized and mapped on Virtex 5 FPGAs.
The main aspects of the previous work are summarized in Table 1 . The columns of the table show the following: the reference of the work, the type of multiplication method, the target FPGA device, the size, delay, and the resource usage of the largest implementation mapped on the target platform. The resource usage is expressed in terms of the number of slices, LUTs, and utilized embedded multipliers. The delays for the designs are rounded to the nearest integer and given in nanoseconds. Quan et al. [5] and Athow and Al-Khalili's [7] designs use excessive amounts of hardware resources. The multiplier proposed by Bessalah et al. [8] can support 1024-bit multiplication, but is extremely slow. The designs reported in Banescu et al. [9] and in Jaiswal and Cheung [10] are the fastest, but they are designed for floating point multiplication and they cannot multiply operands larger than 130 bits. The design reported in Gao et al. [11] is approximately five times slower than the fastest implementations and suffers from the same limited operand size shortcoming.
The previous large multiplier designs are mostly combinational and achieve high execution speeds by liberally using FPGA resources. Currently, a 256-bit multiplier is the largest combinational implementation that can be mapped on a Virtex 5 FPGA by using all arithmetic slices. However, in practice all the arithmetic slices cannot be dedicated only to multiplier logic. Another issue is that the performances of the previous designs usually depend on some key attributes of the platforms such as the existence of the fast carry chains and the size of the built-in multipliers. Model-dependent optimization may not give the same results on all platforms, since even the members of the same FPGA family can have structural differences. Especially when the resources are very limited, the sequential designs are good alternatives to combinational designs. They require relatively small amount of resources, they can be mapped on any FPGA model, and they can multiply operands of any size as long as there exist enough resources for storage. Naturally, the sequential designs have higher latency compared with the combinational designs. On the other hand, pipelining and using fast methods such as the KO method can improve the performance of sequential designs. This paper presents single and multiple precision sequential large multiplier designs that explore this niche. The proposed designs decompose the large operands and use the KO algorithm to multiply the suboperands. The designs are pipelined to achieve maximum clock frequency. Both of them can generate full size products and this function is not even mentioned in most of the previous work; 256-bit, 512-bit, 1024-bit, and 2048-bit implementations of the proposed designs are mapped on FPGAs. The synthesis results are compared against the synthesis results given in previous large multiplier implementations. The rest of the paper is organized as follows: Section 2 presents the sequential large multiplication method and its implementation, Section 3 presents the multiple precision large multiplication method and its implementation, Section 4 gives delay and hardware usage results, and Section 5 presents the conclusion.
The sequential large KO multiplication (SLKOM)
The SLKOM algorithm first performs suboperand multiplications and then adds their products. A brief explanation for the decomposition method is given in the following: assume that w -bit large operands A and B are decomposed into n -bit suboperands. A and B can be expressed as the summation of the suboperands as:
where A i and B j represent i th and j th suboperands of A and B , respectively, and p represents the number of suboperands and is computed usingp = ⌈ w n ⌉ . The multiplication of A i and B j generates a 2w -bit product, M , which can be also expressed as the sum of the suboperand multiplications as:
The computation of M using (2) requires p 2 n-bit multiplications and (p 2 − 1)2n-bit additions.
The decomposition method is modified for KO implementation as follows: let A 2i+1 A 2i and B 2j+1 B 2j be 2n-bit suboperands obtained by concatenating n-bit sub-operands A 2i , A 2i+1 and B 2j , B 2j+1 , respectively.
Eq. (1) is rewritten as:
Three terms are defined using the suboperands as:
Eq. (2) is rewritten using these terms as:
The computation of M using Eq. (5) requires 0.5p 2 n-bit multiplications, 0.25p 2 n + 1 -bit multiplications, and
Algorithm 1 shows the steps and the data flow in time and space for the SLKOM. In this algorithm, '&' represents the concatenation operation, {0} n−3 ' represents a string of n − 3 zeros, and the subscript notation ' x : p ' represents the string of bits from position x to p . For example, M 2n−1:n represents the bits from positions 2n − 1 to n . The algorithm consists of two parts. The first part generates w less significant product bits. The second part generates w more significant product bits when needed. The inner loop in the first part is not iterated in time; the iterations in these loops show the inputs and outputs of the multipliers and adders. For example, the multiplication, A 0 · B 0 , is performed by Multiplier at iteration 0, and the multiplication, A 0 · B p−1 , is performed by Multiplier p − 1 at iteration 0. In the first part, each iteration of the outer loop generates 2n bits of the product. In the second part, the loop shows how the carry and sum values (C and S) are aligned and combined into two vectors, CN and SN , respectively. These vectors are added to generate w more significant product bits. Figure 1 shows the block diagram for the SLKOM. The design has two main parts. The first part has five pipeline stages and when the pipeline is filled this part computes the less significant w -bits of the product in p/2 cycles. Moreover, an extra cycle is needed between large multiplications to reset the registers that hold values left by the previous multiplication. The second part is called "Align and add stage" and it computes the more significant w bits of the product. The units and their functions in all stages are explained as follows:
Implementation of a SLKOM
Stage 1: In this stage, two w -bit registers, R1 and R , are used to store operands A and B , respectively. R1 is a right-shift register, which shifts 2n bits in each cycle. Moreover, in the first stage p/2 + 1n -bit adders perform the additions(
Stage 2:
In the second stage, the even numbered n -bit multipliers multiply the suboperand A 2j by the suboperands B 0 B 2 , . . . B p−4 B p−2 . The product generated by an even numbered n -bit multiplier j is represented as P L j . The odd numbered n -bit multipliers multiply the suboperand A 2j+1 by suboperands
The product generated by an odd numbered n -bit multiplier j is represented as P H j . Furthermore, p/2(n + 1)-bit multipliers multiply the outputs of the adders generated in Stage 1. The output of the first adder, R1S , is multiplied by the outputs of the adders, RS j s. The products generated by these multiplications are represented as P T j s. 
Stage 4:
This stage consists of p multioperand adders (M OAs) that sum the products, P L , P M , and P H , generated in the the previous stages and the outputs of M OA s generated in the previous cycle. The sum and carry outputs of M OA j at cycle i are represented as S j (i) and C j (i) . To align the inputs of the MOAs, P L , P M , and P H values are further divided into low and high parts as P LL , P LH , P M L , P M H , P HL , and P HH , respectively. M OA 0 adds P LL 0 , S 2 (i − 1), C 1 (i − 1) and a carry bit, CT , which is generated by 
Stage 5:
This stage consists of an n -bit adder and a w -bit right-shift register (R2). In every cycle, 2n bits of the product are generated by adding S 1 (i − 1) and C 0 (i − 1) , and concatenating their sum with S 0 (i − 1) . The sum output of this adder is shifted into R2 . The carry-out of the n-bit adder, CT , is added by the M OA 0 in Stage 4. After p/2 iterations, the w -bit right shift-register, R2 holds the less significant half of the product. Then, S i s and C i s generated in this stage can be used to calculate the more significant half of the product in the "Align and add stage".
Align and add stage:
This stage is independent from the pipelined structure. The align and add stage can compute the w more significant bits of the product while the first part is multiplying another large operand. The w -bit CPA located in this stage adds S p+1:2 s and C p:1 vectors with the carry bit CT . This addition can also be carried sequentially as long as the delay for the computation is less than the delay for the first part. By this way a smaller adder than the current one can be used in the implementation.
Implementation of a multiple-precision SLKOM (MPSLKOM)
The SLKOM design can also be used to perform low precision multiplications. For example, a 2048-bit SLKOM can multiply operands smaller than 2048 bits by setting the unused inputs to zeroes and decreasing the number of iterations. However, this method is not very efficient, since the hardware that processes the zero inputs does not really contribute to the computation. This problem is solved by modifying the SLKOM design. The modified design is called MPSLKOM. Figure 3 shows the block diagram for the MPSLKOM design. Similar to the SKOLM implementation, the design has five pipeline stages. Each stage consists of k blocks that can process (w/k) -bit operands. At the lowest precision, each column functions as an independent (w/k) -bit multiplier and executes k parallel multiplications. When the operand precision is doubled, columns are paired and each pair of columns functions as a (2w/k)-bit multiplier. At the highest precision, all the columns are combined and function as a single w -bit multiplier. In general, the MPSLKOM design can multiply (cw/k) -bit operands, where c is any power of 2 that is less than or equal to k . The precision of the multiplier is set by using the control signal sp .
In general, the logic design of the MPSLKOM is almost identical to the logic design of the SLKOM. Thus, only the details of the modified stages are shown in Figure 4 . The logic designs of the blocks in Stages 2 and 3 are exactly the same as the logic design of the SKOLM's Stages 2 and 3. In Figure 4 to k − 2 , when the blocks are combined, the stored value is changed to R2 0 (i + 1) in the left most block of the group, and it is kept the same in the other blocks of the group.
Align and add stage: Similar to the SLKOM design, the blocks in this stage are independent from the blocks in the pipelined part. The blocks contain (w/k)-bit adders that compute the more significant half of the products. The inputs of the n -bit adder are modified as follows: the first input is CT in block 0. It can be either CT or [t − 1]CO in the other blocks. When the blocks are combined, the first input is CT in the right most block of the group, and it is [t − 1]CO in the other blocks of the group. In block k − 1, the second input is S p+1:2 . In blocks 0 to k − 2, when the blocks are combined, the second input is S p+1:2 in the left most block of the group, and it is [t + 1]S 1:0 &S2 p−1 in the other blocks of the group. The third input is the same in all blocks. When the blocks are combined, the third input of the adder is aligned C p:1 in the left most block of the group and it is 0&C p−1:1 .
Results
This section presents the syntheses results for the SLKOM and the MPSLKOM implementations and their comparisons with previous large multiplier designs. VHDL models for the implementations of the proposed designs are written. The functional verification of all models is tested by exhaustive simulation. The models are synthesized using Xilinx ISE tool set and mapped on Virtex FPGAs. For all syntheses the models are optimized for speed and the target FPGA speed grades are set to -2. Table 2 presents the comparison between the standard sequential large multiplier (SSLM) implementations presented in [12] and the SLKOM and MPSLKOM implementations presented in the present study. VHDL models of these implementations are mapped on Virtex 5 xc5vfx100t FPGAs. The columns in Table 2 show the operand sizes, the multiplier types, the number of clock cycles, the delays in nanoseconds, and the number and utilization percentages of registers, LUTs, and DSPs. In [12] , the clock periods for all SSLM implementations are given in the range of 4.143 to 4.157 ns. The clock periods for all SLKOM and MPSLKOM implementations are equal to 4.159 ns. The total delay for each implementation is equal to (p/2 + 1) clock periods, where p is the number of the suboperands. The delay for the "Align and add stage" is not taken into account for the calculation of the total delay since this stage is independent from the other stages and it can run while the other stages perform the next large multiplication. The SLKOM and MPSLKOM implementations use more hardware resources and require fewer cycles to generate the product than the SSLM implementations. The synthesis results show that the SLKOM implementations are 2.11 to 2.23 times faster and use 55% to 59% more DSP slices than the SSLM implementations. The MPSLKOM implementations have up to 3% more register and LUT utilization compared to the SLKOM implementations, while both designs' implementations use the same number of DSP slices. Table 3 presents a comparison of the SLKOM implementations with the previous implementations. Since the previous designs were mapped on different Virtex FPGAs, to make fair comparisons, 256-bit and 512-bit SLKOM implementations were mapped on the same models of Virtex 2, Virtex 4, and Virtex 5 families. The 256-bit SLKOM implementation had better delay than the referenced previous implementations, except the ones presented in [7] and [11] . Compared with the 256 by 256 SKOLM, the 221 by 221 design reported in Athow and Al-Khalili [7] was 2.75 times faster and used 7 times more DSPs; the 51 by 192 design reported in Gao et al. [11] was 2.64 times faster and used the same number of DSPs. On the other hand, at least six 51 by 192 multipliers are needed to multiply 256-bit operands. The register usage values for most of the previous implementations have not been reported, and thus this parameter is not shown in the resource usage column. However, the pipelined designs are expected to use much more registers than the combinational designs. The register utilization percentages for 256-bit SLKOM implementations are roughly 5% for all Virtex 5 platforms. Table 4 presents the syntheses results for MPSLKOM 512-bit, 1024-bit, and 2048-bit implementations on Virtex 5 xc5vfx100t FPGAs. The table presents the following values for each supported precision: the total number of cycles per multiplication, the number of parallel multiplications, the total delay for a single operation, and the delay per multiplication. For each implementation, the minimum operand precision is 256 bits, the delay/multiplication is calculated by dividing the delay for a single multiplication by the number of parallel multiplications. The results show that the 2048-bit MPSLKOM's delay/multiplication is less than the delay/multiplication of the fastest 256-bit combinational multiplier's delay/multiplication [10] . The 2048-bit implementation can also perform 512-bit, 1024-bit multiplications 4 and 2 times faster than an SLKOM implementation, respectively. In general, all the MPSLKOM implementations have higher throughput than the SLKOM implementations in low precision operation modes. Since a small amount of extra hardware is enough to convert an SLKOM to a MPSLKOM, they are expected to be preferred more than the SLKOMs. Note that instead of a MPLSKOM, multiple low precision SLKOMs can be mapped on an FPGA by using approximately the same amount of hardware, but those low precision SLKOMs cannot be used to multiply higher precision operands.
Conclusion
This paper presented single and multiple precision sequential large multiplier designs for FPGAs (SLKOM and MPSLKOM). Both designs offer significant hardware savings compared with combinational designs, and thus, much larger sequential implementations than the combinational ones can be mapped on FPGAs. For example, 2048-bit SLKOM and MPSLKOM implementations use 75% DSP slices of a Virtex 5 FPGA. We modeled and synthesized 256-bit to 2048-bit implementations of SLKOM and MPSLKOM designs. The syntheses results show that the speed disadvantage of the sequential implementations can be solved by increasing the throughput. This can be observed from the results of MPSKOLM implementations. For example, the delay per multiplication for a 2048-bit MPSLKOM implementation was 4.679 ns at 256-bit multiplication mode, which was less than the delay for the fastest combinational implementation. The 2048-bit MPSLKOM implementation can also perform two 1024-bit multiplications and four 512-bit multiplications in parallel.
