Turkish Journal of Electrical Engineering and Computer Sciences
Volume 24

Number 4

Article 65

1-1-2016

Single and multiple precision sequential large multipliers for fieldprogrammable gate arrays
ALİ ŞENTÜRK
MUSTAFA GÖK

Follow this and additional works at: https://journals.tubitak.gov.tr/elektrik
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Computer Engineering Commons

Recommended Citation
ŞENTÜRK, ALİ and GÖK, MUSTAFA (2016) "Single and multiple precision sequential large multipliers for
field-programmable gate arrays," Turkish Journal of Electrical Engineering and Computer Sciences: Vol.
24: No. 4, Article 65. https://doi.org/10.3906/elk-1405-107
Available at: https://journals.tubitak.gov.tr/elektrik/vol24/iss4/65

This Article is brought to you for free and open access by TÜBİTAK Academic Journals. It has been accepted for
inclusion in Turkish Journal of Electrical Engineering and Computer Sciences by an authorized editor of TÜBİTAK
Academic Journals. For more information, please contact academic.publications@tubitak.gov.tr.

Turkish Journal of Electrical Engineering & Computer Sciences
http://journals.tubitak.gov.tr/elektrik/

Turk J Elec Eng & Comp Sci
(2016) 24: 2961 – 2973
c TÜBİTAK
⃝
doi:10.3906/elk-1405-107

Research Article

Single and multiple precision sequential large multipliers for field-programmable
gate arrays
Ali ŞENTÜRK1,∗, Mustafa GÖK2
Department of Computer Engineering, Çukurova University, Adana, Turkey
2
Department of Electrical Engineering, Çukurova University, Adana, Turkey

1

Received: 16.05.2014

•

Accepted/Published Online: 18.01.2015

•

Final Version: 15.04.2016

Abstract: This paper presents single and multiple precision sequential large multiplier designs for field-programmable
gate arrays. Both designs use the Karatsuba–Ofman method. They are pipelined and can generate a full size (double
operand size) or a single size product. The syntheses results show that the sequential large Karatsuba–Ofman multiplier
(SLKOM) implementations have up to 2.23 times less delay compared with the standard sequential large multipliers
implementations presented in previous research. The 2048-bit multiple precision sequential Karatsuba–Ofman large
multiplier (MPSLKOM) implementation can simultaneously execute eight 256-bit multiplications. The MPSLKOM
implementations use roughly 1% more registers and up to 3% more LUTs than the SLKOM implementations.
Key words: Large multipliers, Karatsuba–Ofman, field-programmable gate array

1. Introduction
Large operands are widely used in scientific, cryptography, multimedia, and signal processing applications.
Multiplication is one of the most used arithmetic operations in these applications [1,2]. General purpose
processors do not contain large multipliers. To compensate for the lack of the hardware, special software
routines or multiple-precision arithmetic libraries can be used to perform the multiplication of large operands
(The GNU Multiple Precision Arithmetic Library). These routines decompose the large operands into standard
size suboperands and perform multiple suboperand multiplications; the products of these multiplications are
aligned and summed to generate the large product. There are algorithms faster than this simple method [3,4],
however, they are constrained to use the standard size multipliers too. Thus, the software only approach
becomes extremely time-consuming when a vast number of large multiplications are executed in applications.
Therefore, there is a genuine need for large multipliers that work fast and use as little logic as possible.
The recent work on the design of large multipliers focuses on field-programmable gate array (FPGA)
implementations due to their rapid design and flexibility advantages [5–12]. A brief discussion of the previous
work is provided.
In [5], hybrid sequential large multipliers are designed using Broadcast (decomposition method’s implementation) and Karatsuba–Ofman (KO) multiplier blocks. Various combinations of these multiplier blocks are
tried out to implement 256-bit multipliers. Among these implementations, the one that uses four hierarchical
stages of KO multipliers is the fastest, but uses the most logic resources; the implementation that uses two
hierarchical stages of two broadcast multipliers is the slowest.
∗ Correspondence:

asenturk@cu.edu.tr

2961

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

In [6], a combinational large multiplier and squarer designs that use the decomposition method are presented. Both designs use fast adder trees to sum the partial products generated by suboperand multiplications;
20-bit to 85-bit multiplier implementations are mapped on Spartan-3 FPGAs.
In [7], another combinational large multiplier design that uses the decomposition method is presented.
The design method exploits the structure of the arithmetic slices and the fast carry chains provided in Virtex
4 FPGAs; 16-bit to 221-bit implementations of the proposed design are mapped on the FPGAs.
In [8], a bit serial large multiplier design is presented. The design uses carry save adders to perform the
addition of partial product bits. The product bits are converted on the fly from borrow save format to two’s
complement format; 128-bit to 1024-bit implementations are mapped on Virtex 2 FPGAs.
In [9], truncated large multiplier designs for high-precision floating-point multiplication are presented.
The truncated multipliers can be used by applications that tolerate truncation error. The study modifies the
KO method and applies it to both multiplication and squaring operations; 23-bit, 52-bit, and 112-bit pipelined
implementations are synthesized and mapped on Virtex 4 FPGAs.
In [10], a large multiplier design that uses a modified KO method for high-precision floating-point
multiplication is presented. A 128-bit quadruple-precision mantissa multiplier has been constructed using one
66-bit and two 65-bit multipliers instead of four 64-bit multipliers.
In [11], two combinational signed-large multipliers designs for FPGAs are presented. The first design
uses symmetric multiplier blocks, while the second one uses asymmetric multiplier blocks; 51 by 68 to 51 by
190 multiplier implementations are mapped on Virtex 5 FPGAs.
In [12], three sequential large multiplier designs for FPGAs are presented. The paper uses the modified
decomposition method and presents the speed-area tradeoﬀ among those designs; 256-bit to 2048-bit implementations are synthesized and mapped on Virtex 5 FPGAs.
The main aspects of the previous work are summarized in Table 1. The columns of the table show the
following: the reference of the work, the type of multiplication method, the target FPGA device, the size, delay,
and the resource usage of the largest implementation mapped on the target platform. The resource usage is
expressed in terms of the number of slices, LUTs, and utilized embedded multipliers.
Table 1. Previous work on large multipliers.

Publication

Mult. method

FPGA

Quan et al., 2005, [5]
Gao et al., 2007, [6]
Athow and Al-Khalili, 2008, [8]
Bessalah et al., 2008, [8]
Banescu et al., 2011,[9]
Jaiswal and Cheung, 2012, [10]
Gao et al., 2012, [11]
Senturk and Gok, 2012, [12]

Hybrid
Decomposed
Decomposed
Serial
Truncated KO,
KO, Div. Conq.
Decomposed
Decomposed

Virtex 2
Spartan 3
Virtex 4
Virtex 2
Virtex 5
Virtex 4
Virtex 5
Virtex 5

Maximum implementation
Size
Delay Hardware
(bits) (ns)
(Slices, LUTs, Muls.)
256
380
17564 slices, 144 mul.
85
22
400 LUT, 15 mul.
221
16
60000 slices, 169 mul.
1024
12091 2688 (CLBs)
112
3
2497 LUT, 19 mul.
130
3
2685 slices, 3505 LUTs, 27 mul.
192
14
1150 LUT, 24 mul.
2048
570
8245 LUT, 121 DSP

The delays for the designs are rounded to the nearest integer and given in nanoseconds. Quan et al. [5]
and Athow and Al-Khalili’s [7] designs use excessive amounts of hardware resources. The multiplier proposed by
Bessalah et al. [8] can support 1024-bit multiplication, but is extremely slow. The designs reported in Banescu
et al. [9] and in Jaiswal and Cheung [10] are the fastest, but they are designed for floating point multiplication
2962

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

and they cannot multiply operands larger than 130 bits. The design reported in Gao et al. [11] is approximately
five times slower than the fastest implementations and suﬀers from the same limited operand size shortcoming.
The previous large multiplier designs are mostly combinational and achieve high execution speeds by
liberally using FPGA resources. Currently, a 256-bit multiplier is the largest combinational implementation
that can be mapped on a Virtex 5 FPGA by using all arithmetic slices. However, in practice all the arithmetic
slices cannot be dedicated only to multiplier logic. Another issue is that the performances of the previous
designs usually depend on some key attributes of the platforms such as the existence of the fast carry chains
and the size of the built-in multipliers. Model-dependent optimization may not give the same results on all
platforms, since even the members of the same FPGA family can have structural diﬀerences. Especially when
the resources are very limited, the sequential designs are good alternatives to combinational designs. They
require relatively small amount of resources, they can be mapped on any FPGA model, and they can multiply
operands of any size as long as there exist enough resources for storage. Naturally, the sequential designs have
higher latency compared with the combinational designs. On the other hand, pipelining and using fast methods
such as the KO method can improve the performance of sequential designs. This paper presents single and
multiple precision sequential large multiplier designs that explore this niche. The proposed designs decompose
the large operands and use the KO algorithm to multiply the suboperands. The designs are pipelined to achieve
maximum clock frequency. Both of them can generate full size products and this function is not even mentioned
in most of the previous work; 256-bit, 512-bit, 1024-bit, and 2048-bit implementations of the proposed designs
are mapped on FPGAs. The synthesis results are compared against the synthesis results given in previous large
multiplier implementations. The rest of the paper is organized as follows: Section 2 presents the sequential large
multiplication method and its implementation, Section 3 presents the multiple precision large multiplication
method and its implementation, Section 4 gives delay and hardware usage results, and Section 5 presents the
conclusion.
2. The sequential large KO multiplication (SLKOM)
The SLKOM algorithm first performs suboperand multiplications and then adds their products. A brief
explanation for the decomposition method is given in the following: assume that w -bit large operands A and
B are decomposed into n -bit suboperands. A and B can be expressed as the summation of the suboperands
as:
A=

p−1
∑

Ai · 2ni , B =

i=0

p−1
∑

Bj · 2nj

(1)

j=0

where Ai and Bj represent ith and j th suboperands of A and B , respectively, and p represents the number
⌈ ⌉
of suboperands and is computed usingp = w
n . The multiplication of Ai and Bj generates a 2w -bit product,
M , which can be also expressed as the sum of the suboperand multiplications as:

M=

p−1 ∑
p−1
∑

Ai · Bj · 2n(i+j)

(2)

i=0 j=0

The computation of M using (2) requires p2 n-bit multiplications and (p2 − 1)2n-bit additions.
The decomposition method is modified for KO implementation as follows: let A2i+1 A2i and B2j+1 B2j
be 2n-bit suboperands obtained by concatenating n-bit sub-operands A2i , A2i+1 and B2j , B2j+1 , respectively.
2963

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

Eq. (1) is rewritten as:
∑

p/2−1

A=

∑

p/2−1

[A2i + 2 A2i+1 ] · 2
n

i=0

2ni

B=

[B2j + 2n B2j+1 ] · 22nj

(3)

j=0

Three terms are defined using the suboperands as:
P (2i, 2j)

=

A2i · B2j, P (2i + 1, 2j + 1) = A2i+1 · B2j+1

P (2i + 1, 2j) + P (2i, 2j + 1)

=

[A2i+1 + A2i ] · [B (2j + 1) + B 2j] − P (2i + 1, 2j + 1) − P (2i, 2j)

(4)

Eq. (2) is rewritten using these terms as:
∑ ∑

p/2−1 p/2−1

M=

i=0

[P (2i, 2j) + 2n+1 ((P (2i + 1, 2j) + P (2i, 2j + 1)) + 22n P (2i + 1, 2j + 1)] · 22n(i+j)

(5)

j=0

The computation of M using Eq. (5) requires 0.5p2 n-bit multiplications, 0.25p2 n + 1 -bit multiplications, and
1.5p2 additions.
Algorithm 1 shows the steps and the data flow in time and space for the SLKOM. In this algorithm, ‘&’
represents the concatenation operation, {0}n−3 ’ represents a string of n − 3 zeros, and the subscript notation
‘ x : p ’ represents the string of bits from position x to p . For example, M2n−1:n represents the bits from
positions 2n − 1 to n . The algorithm consists of two parts. The first part generates w less significant product
bits. The second part generates w more significant product bits when needed. The inner loop in the first part
is not iterated in time; the iterations in these loops show the inputs and outputs of the multipliers and adders.
For example, the multiplication, A0 · B0 , is performed by Multiplier at iteration 0, and the multiplication,
A0 · Bp−1 , is performed by Multiplier p − 1 at iteration 0. In the first part, each iteration of the outer loop
generates 2n bits of the product. In the second part, the loop shows how the carry and sum values (C and S)
are aligned and combined into two vectors, CN and SN , respectively. These vectors are added to generate w
more significant product bits.
3. Implementation of a SLKOM
Figure 1 shows the block diagram for the SLKOM. The design has two main parts. The first part has five
pipeline stages and when the pipeline is filled this part computes the less significant w -bits of the product in
p/2 cycles. Moreover, an extra cycle is needed between large multiplications to reset the registers that hold
values left by the previous multiplication. The second part is called “Align and add stage” and it computes the
more significant w bits of the product. The units and their functions in all stages are explained as follows:
Stage 1: In this stage, two w -bit registers, R1 and R , are used to store operands A and B , respectively.
R1 is a right-shift register, which shifts 2n bits in each cycle. Moreover, in the first stage p/2 + 1n -bit adders
perform the additions(A2j + A2j+1 ), (B0 + B1 ), (B2 + B3 ) . . . (Bp−4 + Bp−3 ), (Bp−2 + Bp−1 ).
Stage 2: In the second stage, the even numbered n -bit multipliers multiply the suboperand A2j by
the suboperands B0 B2 , . . . Bp−4 Bp−2 . The product generated by an even numbered n -bit multiplier j is
represented as P Lj . The odd numbered n -bit multipliers multiply the suboperand A2j+1 by suboperands
B1 B3 , . . . , Bp−3 Bp−1 . The product generated by an odd numbered n -bit multiplier j is represented as P Hj .
2964

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

Algorithm 1. Sequential KO large multiplication.

2965

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

n bits
p− 1

1st pipeline stage

R0

0

R 0p− 1

R 0p− 2

R 01

p/ 2− 1

P T p/ 2− 1

R 0p− 1

R 0p− 2

R 11

PH1

P T1

R 01

P M H p/ 2− 2 S p (i − 1)

C p (i)

0

PM0

C 1 (i − 1)
P LL 0
P M H 0 S 4 (i − 1)
CT
S 2 (i − 1)
C 3 (i − 1)
P LL 1
P HL 0
2

S p− 2 (i)

C 2 (i)

P LH p/ 2− 1 S p+1 (i − 1)

S p (i)
PML

0

S 2 (i)

p/ 2− 1

P HH p/ 2− 2 C p (i − 1)

P LH 1

C 0 (i)

PML

1

S 5 (i − 1)

P HH 0

C p− 1 (i)

PML

C 4 (i − 1)

p− 1

S p+1 (i)

5th pipeline stage

PL 0

C 0 (i − 1)
S 1 (i − 1)

&

S p− 1 (i)

C 2 (i − 1)

C 3 (i)

S 3 (i)

1

C 1 (i)

p/ 2 − 1 p/ 2 − 2

1

0

R 2 right shift register
CT
C p:1

CT
S p+1:2

Align

S 3 (i − 1)

3

2n bits
S 0 (i − 1)

0

P LH 0

M 2w − 1: w

Figure 1. Block diagram for the sequential KaratsubaOfman large multiplier.

2966

S 0 (i)

p

P HH p/ 2− 1

align & add stage

0

PH0

p− 2

C p− 2 (i)

R 10

PL 0

PM1

C p− 1 (i − 1)

R 00

PH0

1

P HL p/ 2− 2

R 11

1

P T0

PL 1

P M p/ 2− 1

P HL p/ 2− 1 P M H p/ 2− 1

R 10

P L p/ 2− 1

p/ 2− 1

p/ 2− 1

R 10

p− 2

P H p/ 2− 1

P T p/ 2− 1 P H p/ 2− 1
P L p/ 2− 1

0

R 1S

p− 1

P T0

P LL

R 11

R 0S 0

0

3rd pipeline stage

2nd pipeline stage

R 0S 0 R 1S

1

0

R 0S p/ 2− 1

4th pipeline stage

R 00

p/ 2− 1

R 0S p/ 2− 1 R 1S

p− 1

R1

1

n bits

right shift register

S 1 (i)

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

M OA 4 M OA 3 M OA 2 M OA 1 M OA 0

PHH0 PHL0 PLH0 PLL0

1st
Iteration

PMH0 PML0

PHH1 PHL1 PLH1 PLL1
PMH1 PML1
S 4 (1) S 3 (1) S 2 (1) S 1 (1) S O (1)
C 4 (1) C 3 (1) C 2 (1) C 1 (1) C O (1)
2nd
Iteration

PHH2 PHL2 PLH2 PLL2
PMH2 PML2

PHH3 PHL3 PLH3 PLL3
PMH3 PML3
PHH1 S 4 (1) S 3 (1) S 2 (1) S 1 (1) S O (1)
C 4 (1) C 3 (1) C 2 (1) C 1 (1) C O (1)
S 4 (2) S 3 (2) S 2 (2) S 1 (2) S O (2)
C 4 (2) C 3 (2) C 2 (2) C 1 (2) C O (2)

Figure 2. An example of addition of partial products.

Furthermore, p/2(n + 1)-bit multipliers multiply the outputs of the adders generated in Stage 1. The output
of the first adder, R1S , is multiplied by the outputs of the adders, RSj s. The products generated by these
multiplications are represented as P Tj s.
Stage 3: This stage consists of p/2 3-operand subtractors.

Each subtractor j computes P Mj =

P Tj − P Hj − P Lj .
Stage 4: This stage consists of p multioperand adders (M OAs) that sum the products, P L , P M , and
P H , generated in the the previous stages and the outputs of M OA s generated in the previous cycle. The sum
and carry outputs of M OAj at cycle i are represented as Sj (i) and Cj (i) . To align the inputs of the MOAs,
P L , P M , and P H values are further divided into low and high parts as P LL , P LH , P M L , P M H , P HL ,
and P HH , respectively. M OA0 adds P LL0 , S2 (i − 1), C1 (i − 1) and a carry bit, CT , which is generated by
the adder located in Stage 5. M OA1 adds P M L0 , P LH0 , S3 (i − 1) and C2 (i − 1) . The rest of the MOAs
except M OAp have five inputs. The even numbered M OA s add P LLj/2 , P M Hj/2−1 , P HLj/2−1 , Sj+2 (i − 1)
and Cj+1 (i − 1) ; the odd numbered M OA s add P M L(j−1)/2 , P LH(j−1)/2 , P HH(j−3)/2 , Sj+2 (i − 1), and
Cj+1 (i − 1), where 2 ≤ j ≤ p − 1. M OAp adds P HLp/2−1 and P M Hp/2−1 . All M OA s generate 3-bit carries.
Example 1 Figure 2 shows an example for the alignment and addition of MOA inputs. It is assumed that there
are five MOAs and the partial products are generated by the multiplication of operands A = A(3)A(2)A(1)(A0)
and B = B(3)B(2)B(1)B(0) . The Sj (i) and Cj (i) values represent the sum and carry outputs of a MOA j at
the iteration i , respectively. S1 (i), S0 (i), and C0 (i) are added in the fifth stage. P HH3 and C4 (2) are added
in the next iteration.
Stage 5: This stage consists of an n -bit adder and a w -bit right-shift register (R2). In every cycle,
2n bits of the product are generated by adding S1 (i − 1) and C0 (i − 1) , and concatenating their sum with
S0 (i − 1) . The sum output of this adder is shifted into R2 . The carry-out of the n-bit adder, CT , is added
by the M OA0 in Stage 4. After p/2 iterations, the w -bit right shift-register, R2 holds the less significant half
2967

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

of the product. Then, Si s and Ci s generated in this stage can be used to calculate the more significant half of
the product in the “Align and add stage”.
Align and add stage: This stage is independent from the pipelined structure. The align and add stage
can compute the w more significant bits of the product while the first part is multiplying another large operand.
The w -bit CPA located in this stage adds Sp+1:2 s and Cp:1 vectors with the carry bit CT . This addition can
also be carried sequentially as long as the delay for the computation is less than the delay for the first part. By
this way a smaller adder than the current one can be used in the implementation.
4. Implementation of a multiple-precision SLKOM (MPSLKOM)
The SLKOM design can also be used to perform low precision multiplications. For example, a 2048-bit SLKOM
can multiply operands smaller than 2048 bits by setting the unused inputs to zeroes and decreasing the number
of iterations. However, this method is not very eﬃcient, since the hardware that processes the zero inputs
does not really contribute to the computation. This problem is solved by modifying the SLKOM design. The
modified design is called MPSLKOM.
Figure 3 shows the block diagram for the MPSLKOM design. Similar to the SKOLM implementation,
the design has five pipeline stages. Each stage consists of k blocks that can process (w/k) -bit operands. At
the lowest precision, each column functions as an independent (w/k) -bit multiplier and executes k parallel
multiplications. When the operand precision is doubled, columns are paired and each pair of columns functions
as a (2w/k)-bit multiplier. At the highest precision, all the columns are combined and function as a single
w -bit multiplier. In general, the MPSLKOM design can multiply (cw/k) -bit operands, where c is any power
of 2 that is less than or equal to k . The precision of the multiplier is set by using the control signal sp .
In general, the logic design of the MPSLKOM is almost identical to the logic design of the SLKOM. Thus,
only the details of the modified stages are shown in Figure 4. The logic designs of the blocks in Stages 2 and 3
are exactly the same as the logic design of the SKOLM’s Stages 2 and 3. In Figure 4 [t − 1] and [t + 1] represent
the previous and the next blocks, respectively. The details of the modifications are explained as follows:
Stage 4: As in the SLKOM implementation, an array of MOAs is used to perform the addition of
partial products and the outputs of MOAs generated in the previous cycle. In the MPSLKOM implementation,
the MOAs 0, 1, p − 2 and p − 1 are modified. They have extra inputs designated by dotted and dashed boxes.
The extra inputs are mutually exclusive. The dotted boxes show the extra inputs that exist only in block 0 or
block −1 ; the dashed boxes show the extra inputs that exist in the rest of the blocks. The multiplexers in the
boxes select one of the two input signals based on the precision of the operands. As explained above, the blocks
are combined when the precision is increased. The details of the modifications made in the MOAs are given in
the following:
• M OA0 : The first input of M OA0 is always CT in block 0. It can be either CT or 0 in the other blocks.
When the blocks are combined, the first input of the M OA0 is CT in the right most block of the group,
and it is 0 in the other blocks of the group. The second input exists in blocks 1 to k − 1. When the
blocks are combined, the second input of the M OA0 is 0 in the right-most block of the group, and it is
[t − 1](Cp &Sp ) in the other blocks of the group.
• M OA1 : The first input of this adder exits in blocks 1 to k − 1. When the blocks are combined, the first
input of the M OA1 is [t − 1]Sp+1 (i) in the right most block of the group, and it is 0 in the other blocks
of the group.
2968

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

• M OAp−2 : The second input of this adder in block k − 1 is Sp (i − 1). In blocks 0 to k − 2, when the
blocks are combined, the second input of the M OAp−2 is Sp (i − 1) in the left most block of the group,
and it is [t + 1]S0 (i − 1) in the other blocks of the group.
• M OAp−1 : The first input of this adder in block k − 1 is Cp (i − 1) . It can be either Cp (i − 1) or
[t + 1]C0 (i − 1) in the other blocks. When the blocks are combined in a group, the first input of the
(M OAp−1 ) is Cp (i − 1) in the left most block of the group, and it is [t + 1]C0 (i − 1) in the other blocks
of the group. The second input of the (M OAp−1 ) in block k − 1 is Sp+1 (i − 1), and it can be either
Sp+1 (i − 1) or [t + 1]S1 (i − 1) in the other blocks. When the blocks are combined, the second input of the
M OAp−1 is Sp+1 (i − 1) in the left most block, and it is [t + 1]S1 (i − 1) in the other blocks of the group.
Stage 5: The block of this stage consists of an n-bit adder and a (w/k)-bit right shift register R2 . In
each cycle, 2n bits of the product are generated by adding S1 (i − 1) and C0 (i − 1) and concatenating the sum
with S0 (i − 1). This value is shifted into R2 . The carry-out of the adder is added by the M OA0 . In blocks 0
to k − 2 , when the blocks are combined, the stored value is changed to R20 (i + 1) in the left most block of the
group, and it is kept the same in the other blocks of the group.
Align and add stage: Similar to the SLKOM design, the blocks in this stage are independent from
the blocks in the pipelined part. The blocks contain (w/k)-bit adders that compute the more significant half
of the products. The inputs of the n -bit adder are modified as follows: the first input is CT in block 0. It can
be either CT or [t − 1]CO in the other blocks. When the blocks are combined, the first input is CT in the
right most block of the group, and it is [t − 1]CO in the other blocks of the group. In block k − 1, the second
input is Sp+1:2 . In blocks 0 to k − 2, when the blocks are combined, the second input is Sp+1:2 in the left most
block of the group, and it is [t + 1]S1:0 &S2p−1 in the other blocks of the group. The third input is the same in
all blocks. When the blocks are combined, the third input of the adder is aligned Cp:1 in the left most block of
the group and it is 0&Cp−1:1 .
5. Results
This section presents the syntheses results for the SLKOM and the MPSLKOM implementations and their
comparisons with previous large multiplier designs. VHDL models for the implementations of the proposed
designs are written. The functional verification of all models is tested by exhaustive simulation. The models are
synthesized using Xilinx ISE tool set and mapped on Virtex FPGAs. For all syntheses the models are optimized
for speed and the target FPGA speed grades are set to –2.
Table 2 presents the comparison between the standard sequential large multiplier (SSLM) implementations presented in [12] and the SLKOM and MPSLKOM implementations presented in the present study. VHDL
models of these implementations are mapped on Virtex 5 xc5vfx100t FPGAs. The columns in Table 2 show the
operand sizes, the multiplier types, the number of clock cycles, the delays in nanoseconds, and the number and
utilization percentages of registers, LUTs, and DSPs. In [12], the clock periods for all SSLM implementations
are given in the range of 4.143 to 4.157 ns. The clock periods for all SLKOM and MPSLKOM implementations
are equal to 4.159 ns. The total delay for each implementation is equal to (p/2 + 1) clock periods, where p
is the number of the suboperands. The delay for the “Align and add stage” is not taken into account for the
calculation of the total delay since this stage is independent from the other stages and it can run while the
other stages perform the next large multiplication. The SLKOM and MPSLKOM implementations use more
2969

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

hardware resources and require fewer cycles to generate the product than the SSLM implementations. The
synthesis results show that the SLKOM implementations are 2.11 to 2.23 times faster and use 55% to 59% more
DSP slices than the SSLM implementations. The MPSLKOM implementations have up to 3% more register
and LUT utilization compared to the SLKOM implementations, while both designs’ implementations use the
same number of DSP slices.
Table 2. Comparison of the SSLM, SLKOM, and MPSLKOM implementations (Virtex 5).

Size
512

1024

2048

Type

Cycles

Delay (ns)

SSLM
SLKOM
MPSLKOM
SSLM
SLKOM
MPSLKOM
SSLM
SLKOM
MPSLKOM

38
17
14
71
33
33
137
65
65

157.434
70.703
70.703
294.508
137.247
137.247
569.509
270.335
270.335

Register
Num. Ut.%
2732
4%
6248
10%
6445
10%
5342
8%
12392 19%
12931 20%
10562 16%
24680 39%
25903 40%

LUT
Num.
2125
3913
4217
4165
7801
8603
8245
15609
17180

Ut.%
3%
6%
7%
7%
12%
13%
12%
24%
27%

DSP
Num.
31
48
48
61
96
96
121
192
192

Ut.%
12%
19%
19%
24%
38%
38%
47%
75%
75%

w bits
2w/k bits
sp

w/k bits

2w/k bits
w/k bits

w/k bits

w/k bits

STAGE 1
BLOCK k − 1

STAGE 1
BLOCK k − 2

STAGE 1
BLOCK 1

STAGE 1
BLOCK 0

STAGE 2
BLOCK k − 1

STAGE 2
BLOCK k − 2

STAGE 2
BLOCK 1

STAGE 2
BLOCK 0

STAGE 3
BLOCK k − 1

STAGE 3
BLOCK k − 2

STAGE 3
BLOCK 1

STAGE 3
BLOCK 0

STAGE 4
BLOCK k − 1

STAGE 4
BLOCK k − 2

STAGE 4
BLOCK 1

STAGE 4
BLOCK 0

STAGE 5
BLOCK k − 1

STAGE 5
BLOCK k − 2

STAGE 5
BLOCK 1

STAGE 5
BLOCK 0

A. & A. STAGE
BLOCK k − 1

A. & A. STAGE
BLOCK k − 2

A. & A. STAGE
BLOCK 1

A. & A. STAGE
BLOCK 0

Figure 3. Block diagram for the multiple-precision sequential KaratsubaOfman large multiplier.

2970

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

Table 3 presents a comparison of the SLKOM implementations with the previous implementations. Since
the previous designs were mapped on diﬀerent Virtex FPGAs, to make fair comparisons, 256-bit and 512-bit
SLKOM implementations were mapped on the same models of Virtex 2, Virtex 4, and Virtex 5 families. The
256-bit SLKOM implementation had better delay than the referenced previous implementations, except the
ones presented in [7] and [11]. Compared with the 256 by 256 SKOLM, the 221 by 221 design reported in
Athow and Al-Khalili [7] was 2.75 times faster and used 7 times more DSPs; the 51 by 192 design reported in
Gao et al. [11] was 2.64 times faster and used the same number of DSPs. On the other hand, at least six 51
by 192 multipliers are needed to multiply 256-bit operands. The register usage values for most of the previous
implementations have not been reported, and thus this parameter is not shown in the resource usage column.
However, the pipelined designs are expected to use much more registers than the combinational designs. The
register utilization percentages for 256-bit SLKOM implementations are roughly 5% for all Virtex 5 platforms.
Table 3. Comparison of SLKOM with previous implementations.

Presented in
Quan et al., [5]
SLKOM
Jaiswal and Cheung, [10]
Athow and Al-Khalili, [7]
SLKOM
Gao et al., [11]
Senturk and Gok, [12]
SLKOM

FPGA
Virtex2
Virtex2
Virtex4
Virtex4
Virtex4
Virtex5
Virtex5
Virtex5

Size
256 × 256
256 × 256
130 × 130
221 × 221
256 × 256
51 × 192
256 × 256
256 × 256

Hardware
17564 Slices
2539 Slices, 24 18 × 18 Mults
2685 Slices 24 DSPs
60000 Slices 169 DSPs
2541 Slices 24 DSPs
1150 LUTs 24 DSPs
816 LUTs 16 DSPs
2018 LUTs 24 DSPs

Delay
380
67
47
16
44
14
78
37

Table 4 presents the syntheses results for MPSLKOM 512-bit, 1024-bit, and 2048-bit implementations
on Virtex 5 xc5vfx100t FPGAs. The table presents the following values for each supported precision: the
total number of cycles per multiplication, the number of parallel multiplications, the total delay for a single
operation, and the delay per multiplication. For each implementation, the minimum operand precision is 256
bits, the delay/multiplication is calculated by dividing the delay for a single multiplication by the number
of parallel multiplications. The results show that the 2048-bit MPSLKOM’s delay/multiplication is less than
the delay/multiplication of the fastest 256-bit combinational multiplier’s delay/multiplication [10]. The 2048bit implementation can also perform 512-bit, 1024-bit multiplications 4 and 2 times faster than an SLKOM
implementation, respectively. In general, all the MPSLKOM implementations have higher throughput than the
SLKOM implementations in low precision operation modes. Since a small amount of extra hardware is enough
Table 4. Synthesis results for MPSLKOM implementations.

Max Size
512
1024

2048

Size
256
512
256
512
1024
256
512
1024
2048

Cycles
9
17
9
17
33
9
17
33
65

# Muls.
2
1
4
2
1
8
4
2
1

Delay (ns)
37.431
70.703
37.431
70.703
137.247
37.431
70.703
137.247
270.335

Delay/Mult
18.716
70.703
9.358
35.352
137.247
4.679
17.676
68.624
270.335

2971

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

n bits
p− 1

R0

1

R 0p − 2

R 0p − 1

0

n bits
Exists
in blocks
0 to k − 2

0

p− 1

[t + 1]R 1 1:0

1

R 01

R 00

R 10

0

p/ 2 − 1

0

R 11

R 0S p/ 2 − 1

R 0S 0

R 1S

R 1 right shift register
ST AGE 1 BLOCK

t

S p (i − 1)
[t + 1]S 0 (i − 1)

Exists
k − 1

S p (i − 1)
P LL

P HL p/ 2 − 1 P M H p/ 2 − 1

P HL p/ 2 − 2
P M H p/ 2 − 2

p/ 2 − 1

p

C p (i)

[t − 1](C p &S p )
0

in Block

C p − 2 (i)

Exists

in Block

S p − 2 (i)

C 2 (i)

C p (i − 1)

P LH p/ 2 − 1
P M L p/ 2 − 1 P HH p/ 2 − 2

S p +1 (i)

Exists

C 0 (i)

C p − 1 (i)

t

S 1 (i − 1)

3

S p − 1 (i)

C 3 (i)

S 3 (i)

1

C 1 (i)

This connection
exists
in Block

&

p/ 2 − 1

Exists
in blocks
0 to k − 2

t

k − 1

2n bits
[t + 1]R 4 0

[t + 1]S 1:0

C p − 1:1

p/ 2 − 2

1

S p +1:

p

CT

This connection
exists
in
Blo ck k − 1

Align

S p − 1:2

[t − 1]C O
Exists
in Blocks
1 to k − 1

&

This

t

CO

M 2 w /i

0

R 4 right shift register

Exists
in blocks
0 to k − 2

&

ALIGN & ADD STAGE BLOCK

S 1 (i)

CT

S 0 (i − 1)

0 Cp

[t − 1]S p +1 (i)
0

S 5 (i − 1) P M L 0
S 3 (i − 1)
P LH 1
C 2 (i − 1)
C 4 (i − 1)
P LH 0
P ML 1
P HH 0

C 0 (i − 1)

ST AGE 5 BLOCK

Exists
in blocks
1 to k − 1

k− 1

p− 1

ST AGE 4 BLOCK

S 0 (i)

Exists
in blocks
0 to k − 2

in Block

in Block 0

0

S 2 (i)

C p (i − 1)
[t + 1]C 0 (i − 1)

S p +1 (i − 1)

k − 1

CT
Exists

2

[t + 1]S 1 (i − 1)
S p +1 (i − 1)

Exists
in blocks
0 to k − 2

P HH p/ 2 − 1

CT
0

S 2 (i − 1)
C 3 (i − 1)
P HL 0
P LL 1
C p − 1 (i − 1)
S 4 (i − 1)
C 1 (i − 1)
P MH 0
P LL 0
p− 2

S p (i)

Exists
in Block
1 to k − 1

Exists
in blocks
0 to k − 2

connection exists in Block 0

− 1: w/k

Figure 4. Details of the blocks in the MPSLKOM.

to convert an SLKOM to a MPSLKOM, they are expected to be preferred more than the SLKOMs. Note that
instead of a MPLSKOM, multiple low precision SLKOMs can be mapped on an FPGA by using approximately
2972

ŞENTÜRK and GÖK/Turk J Elec Eng & Comp Sci

the same amount of hardware, but those low precision SLKOMs cannot be used to multiply higher precision
operands.
6. Conclusion
This paper presented single and multiple precision sequential large multiplier designs for FPGAs (SLKOM and
MPSLKOM). Both designs oﬀer significant hardware savings compared with combinational designs, and thus,
much larger sequential implementations than the combinational ones can be mapped on FPGAs. For example,
2048-bit SLKOM and MPSLKOM implementations use 75% DSP slices of a Virtex 5 FPGA. We modeled and
synthesized 256-bit to 2048-bit implementations of SLKOM and MPSLKOM designs. The syntheses results show
that the speed disadvantage of the sequential implementations can be solved by increasing the throughput. This
can be observed from the results of MPSKOLM implementations. For example, the delay per multiplication for
a 2048-bit MPSLKOM implementation was 4.679 ns at 256-bit multiplication mode, which was less than the
delay for the fastest combinational implementation. The 2048-bit MPSLKOM implementation can also perform
two 1024-bit multiplications and four 512-bit multiplications in parallel.
References
[1] FIPS P. 186-2. Digital Signature Standard (DSS). Gaithersburg, MD, USA: National Institute of Standards and
Technology (NIST), 2000.
[2] Menezes AJ, Van Oorschot PC, Vanstone SA. Handbook of Applied Cryptography. Boca Raton, FL, USA: CRC
press, 1996.
[3] Bodrato M. Towards optimal Toom-Cook multiplication for univariate and multivariate polynomials in characteristic
2 and 0. In: Arithmetic of Finite Fields, June 21–22 2007; Madrid, Spain: Springer. pp. 116-133.
[4] Karatsuba A, Ofman Y. Multiplication of multidigit numbers on automata. English translation in Soviet PhysicsDoklady 1963; 7: 595-596.
[5] Quan G, Davis JP, Devarkal S, Buell DA. High-level synthesis for large bit-width multipliers on FPGAs: a case
study. In: Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and
System Synthesis; 19–21 September 2005; Jersey City, NJ, USA: ACM. pp. 213-218.
[6] Gao S, Chabini N, Al-Khalili D, Langlois P. Optimised realisations of large integer multipliers and squarers using
embedded blocks. IET Comput Digit Tec 2007; 1: 9-16.
[7] Athow JL, Al-Khalili AJ. Implementation of large-integer hardware multiplier in Xilinx FPGA. In: 15th IEEE
International Conference on Electronics, Circuits and Systems, ICECS; 31 August–3 September 2008; St. Julian’s,
Malta: IEEE. pp. 1300-1303.
[8] Bessalah H, Messaoudi K, Issad M, Anane N, Anane M. Left to right serial multiplier for large numbers on FPGA.
In: The 3rd International Design and Test Workshop; 20–22 December 2008; Monastir, Tunisia: IEEE. pp. 288-293.
[9] Banescu S, De Dinechin F, Pasca B, Tudoran R. Multipliers for floating-point double precision and beyond on
FPGAs. ACM Comp Ar 2011; 38: 73-79.
[10] Jaiswal M, Cheung RC. Area-eﬃcient architectures for large integer and quadruple precision floating point multipliers. In: The 20th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM);
29 April–1 May 2012; Toronto, Canada: IEEE. pp. 25-28.
[11] Gao S, Al-Khalili D, Chabini N, Langlois P. Asymmetric large size multipliers with optimised FPGA resource
utilisation. IET Comput Digit Tec 2012; 6: 372-383.
[12] Senturk A, Gok M. Pipelined large multiplier designs on FPGAs. In: The 15th Euromicro Conference on Digital
System Design; 5–8 September 2012; Izmir, Turkey: IEEE. pp. 809-814.

2973

