High-speed FPGA 10's complement adders-subtractors by Bioul, G. et al.
Hindawi Publishing Corporation
International Journal of Reconfigurable Computing
Volume 2010, Article ID 219764, 14 pages
doi:10.1155/2010/219764
Research Article
High-Speed FPGA 10’s Complement Adders-Subtractors
G. Bioul,1, 2 M. Vazquez,2 J. P. Deschamps,1, 3 and G. Sutter4
1 Faculty of System Engineering, FASTA University, Mar del Plata, Argentina
2Faculty of System Engineering, UNCPBA University, Tandil, Argentina
3 School of Engineering, Rovira I Virgili University, Tarragona, Spain
4 School of Engineering, Universidad Auto´noma de Madrid, Madrid, Spain
Correspondence should be addressed to G. Sutter, gustavo.sutter@uam.es
Received 3 June 2009; Accepted 22 October 2009
Academic Editor: Elı´as Todorovich
Copyright © 2010 G. Bioul et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper first presents a study on the classical BCD adders from which a carry-chain type adder is redesigned to fit within
the Xilinx FPGA’s platforms. Some new concepts are presented to compute the P and G functions for carry-chain optimization
purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations of add/subtract algorithms
for 10’s complement BCD numbers. Carry-chain type circuits have been designed on 4-input LUTs (Virtex-4, Spartan-3) and
6-input LUTs (Virtex-5) Xilinx FPGA platforms. All designs are presented with the corresponding time performance and area
consumption figures. Results have been compared to straight implementations of a decimal ripple-carry adder and an FPGA 2’s
complement binary adder-subtractor using the dedicated carry logic, both carried out on the same platform. Better time delays
have been registered for decimal numbers within the same range of operands.
1. Introduction
In a number of computer arithmetic applications, decimal
systems are preferred to the binary ones. The reasons come
not only from the complexity of coding/decoding interfaces
but mostly from the lack of precision and clarity in the results
of the binary systems.
Decimal arithmetic plays a key role in data processing
environments such as commercial, financial, and Internet-
based applications [1–3]. Performances required by appli-
cations with intensive decimal arithmetic are not met by
most of the conventional software-based decimal arith-
metic libraries [1]. Hardware implementation embedded in
recently commercialized general purpose processors [3, 4] is
gaining importance.
Furthermore, IEEE has recently published a new standard
754-2008 [5] that supports the floating point representation
for decimal numbers.
At the moment, Binary Coded Decimal (BCD) is used for
decimal arithmetic algorithm implementations. Although
other coding systems may be of interest, BCD seems to be
the best choice until now. Issues of hardware realization of
decimal arithmetic units appear to be widely open: potential
improvements are expected in what refers to algorithm
concepts as well as to hardware design. This paper resumes
some new concepts about carry-chain type algorithms for
adding BCD numbers. Two key ideas have been introduced:
(i) the Propagate P and generate G functions are computed
from the input data instead of intermediate BCD sums, and
(ii) the functions have been implemented in Xilinx Virtex-4
[6] and Virtex-5 FPGA platforms [7], taking advantage of the
6-input LUTs structure of Virtex-5 version.
Signed numbers addition is used as a primitive operation
for computing most arithmetic functions, so that it deserves
particular attention. It is well known that in classical
algorithms the execution time of any program or circuit is
proportional to the number N of digits of the operands.
In order to minimize the computation time, several ideas
have been proposed in the literature [8, 9]. Most of them
consist in modifying the classical algorithm in such a way
as to minimize the computation time of each carry; the
time complexity may still be proportional to N, but the
proportionality constant may be reduced. Moreover, it has
to be pointed out that, within the same range, decimal
addition involves shorter carry propagation process than for
the straight binary code. It will be shown in the practical
2 International Journal of Reconfigurable Computing
implementations that adding BCD digits can not only
save coding interfaces but moreover provides time delay
reductions. Hardware consumption for BCD will be greater,
if coding and decoding processes are not considered; as of
today, the dramatic decreasing of hardware cost stimulates
work on time saving.
In this paper, decimal carry-chain and ripple-carry
adders have been implemented on Virtex-4 Xilinx FPGA
platforms, for a number of operand sizes; comparative
performances are presented for binary and BCD digit
operands.
Additionally, three implementations of adders-
subtractors have been implemented on FPGA Xilinx
Virtex-5 platforms for a number of operand sizes;
comparative performances are presented for binary and
BCD digit operands, respectively. Adder-subtractor inputs
are 10’s complement signed BCD numbers; sign-change
algorithm is used whenever subtraction is at hand.
2. Base-B Ripple-Carry Adders
Consider the base-B representations of two n-digit numbers:
x = xn−1 · Bn−1 + xn−2 · Bn−2 + · · · + x0 · B0,
y = yn−1 · Bn−1 + yn−2 · Bn−2 + · · · + y0 · B0·
(1)
Algorithm 1 (pencil and paper) computes the (n+1)-digit
representation of the sum z = x+ y+cin where cin is an initial
carry equal to 0 or 1.
Algorithm 1. Classic addition (ripple carry):
c(0) := c in;
for i in 0 · · ·n− 1 loop
if x(i) + y(i) + c(i) > B − 1 then c(i + 1) := 1;
else c(i + 1) := 0; end if ;
z(i) := (x(i) + y(i) + c(i)) modB;
end loop;
z(n) := c(n);
As c(i + 1) is a function of c(i), the execution time
of Algorithm 1 is proportional to n (Figure 1). In order to
reduce the execution time of each iteration step, Algorithm 1
can be modified as shown in Section 3.
3. Base-B Carry-Chain Adders
First define two binary functions of two B-valued variables,
namely, the propagate (P) and generate (G) functions:
P(i) ≡
⎧
⎨
⎩
P
(
x(i), y(i)
) = 1 if x(i) + y(i) = B − 1,
P
(
x(i), y(i)
) = 0 otherwise;
G(i) ≡
⎧
⎨
⎩
G
(
x(i), y(i)
) = 1 if x(i) + y(i) > B − 1,
G
(
x(i), y(i)
) = 0 otherwise.
(2)
The next carry ci+1 can be calculated as follows:
if P(i) = 1 then c(i + 1) := c(i);
else c(i + 1) := G(i); end if ;
The corresponding modified Algorithm 2 is the following
one.
Algorithm 2. Carry-chain addition
– computation of the generation and propagation condi-
tions:
for i in 0 · · ·n− 1 loop
G(i) := G(x(i), y(i));
P(i) := P(x(i), y(i));
end loop;
– carry computation:
c(0) := c in;
for i in 0 · · ·n− 1 loop
if P(i) = 1 then c(i + 1) := c(i);
else c(i + 1) := G(i); end if ;
(3)
end loop;
– sum computation
for i in 0 · · ·n− 1 loop
z(i) := (x(i) + y(i) + c(i)) mod B;
end loop;
z(n) := c(n);
Comments.
(1) Instruction sentence (3) is equivalent to the following
Boolean equation:
c(i + 1) = P(i) · c(i)∨ not (P(i)) ·G(i). (4)
Furthermore, if the preceding relation is used, then
the definition of the generate function can be modi-
fied:
G(i) = 1 if x(i) + y(i) > B − 1,
G(i) = 0 if x(i) + y(i) < B − 1,
G(i) = 0 or 1 (do not care) otherwise.
(5)
(2) Another Boolean equation equivalent to (4) is
c(i + 1) = G(i)∨ P(i) · c(i). (6)
If the preceding relation is used, then the definition of the
propagate function can be modified:
P(i) =
⎧
⎨
⎩
1 if x(i) + y(i) = B − 1,
0 if x(i) + y(i) < B − 1,
P(i) = 0 or 1 (do not care) otherwise.
(7)
The structure of an n-digit adder with separate carry
calculation is shown in Figure 2. It is based on Algorithm 2.
The G-P (Generate-Propagate) cell calculates the Generate
and Propagate functions (2).
International Journal of Reconfigurable Computing 3
mod B
full 
adder
mod B
full 
adder
mod B
full 
adder
x(n− 1) y(n− 1)
c(n)
= z(n)
z(n− 1)
c(n− 1)
· · ·
x(1) y(1)
c(2)
z(1)
c(1)
x(0) y(0)
c(0)
= c in
z(0)
Figure 1: Ripple-Carry adder.
mod B sum mod B summod B sum mod B sum
x(n− 1) y(n− 1)
G-P
g(n− 1)
p(n− 1)
z(n)
= c(n) Cy.Ch.
x(n− 1) y(n− 1)
z(n− 1)
c(n− 1)
x(n− 2) y(n− 2)
G-P
g(n− 2) p(n− 2)
Cy.Ch.
x(n− 2) y(n− 2)
z(n− 2)
c(n− 2) · · ·
x(1) y(1)
G-P
g(1) p(1)
Cy.Ch.
x(1) y(1)
z(1)
c(1)c(2)
x(0) y(0)
G-P
g(0) p(0)
Cy.Ch.
x(0) y(0)
z(0)
c(0) = c in
Figure 2: Carry-chain adder.
The Cy.Ch (carry-chain) cell computes the next carry,
that is to say
c(i + 1) =
⎧
⎨
⎩
c(i) if P(i) = 1,
G(i) otherwise,
(8)
so that G(i) generates a carry, whatever happens upstream in
the carry-chain, and P(i) propagates the carry from level i−1.
The modB sum cell calculates
z(i) = (x(i) + y(i) + c(i)) mod B. (9)
As regards the computation time T , the critical path is
shaded in Figure 2. It has been assumed that Tsum > TCy.Ch.
Another interesting time is the delay Tcarry(n) from c(0)
to c(n) assuming that all propagate and generate functions
have already been calculated:
Tcarry(n) = n · Tmux2−1. (10)
Comments. The carry-chain cells are binary circuits, whereas
the generate-propagate and the mod B sum cells are B-ary
ones.
Equation (4) can be implemented by a 2-to-1 binary
multiplexer (Figure 3(a)) while (6) by a 2-gate circuit
P(i)
c(i + 1)
G(i)
c(i)
0
1
(a) Carry multiplexer
P(i)
c(i)
c(i + 1)
G(i)
(b) AND-OR circuit
Figure 3: Carry-chain cells.
(Figure 3(b)). In the first case, the per-digit-delay of a carry-
chain adder is equal to the delay Tmux2-1 of a 2-to-1 binary
multiplexer, whatever the base B is.
If B = 2 and the carry-chain cell of Figure 3(a) is used,
then P(i) = x(i) ⊕ y(i) and G(i) can be chosen equal to,
for example, y(i). The corresponding cell for a n-bit binary
adder is shown in Figure 4.
4. Base-10 Complement and Addition
4.1. Ten’s Complement Numeration System. B’s complement
representation general principles are available in the liter-
ature as, for example, computer arithmetic books such as
4 International Journal of Reconfigurable Computing
x(i)
y(i)
P(i)
c(i + 1)
0 1
c(i)
z(i)
Figure 4: Binary adder cell.
[8, 9]. One restricts to 10’s complement system to cope
with the needs of this paper. A one-to-one function R(x),
associating a natural number to x, is defined as follows.
Every integer x belonging to the range
−10
n
2
≤ x < 10
n
2
(11)
is represented by R(x) = xmod 10n, so that the integer
represented in the form xn−1 xn−2 · · · x1 x0 is
xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0
if xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0 < 10
n
2
,
xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0 − 10n
if xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0 ≥ 10
n
2
.
(12)
The conditions (12) may be more simply expressed as
xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0 if xn−1 < 5,
xn−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0 − 10n if xn−1 ≥ 5.
(13)
Another way to express a 10’s complement number is
x′n−1 · 10n−1 + xn−2 · 10n−2 + · · · + x0, (14)
where x′n−1 = xn−1 − 10 if xn−1 ≥ 5 and
x′n−1 = xn−1 if xn−1 < 5, (15)
while the sign definition rule is the following one:
if x is negative, then xn−1 ≥ 5; otherwise xn−1 < 5.
4.2. Ten’s Complement Sign Change. Given an n-digit 10’s
complement integer x, the inverse z = −x of x is an (n + 1)-
digit 10’s complement integer. Actually the only case that −x
cannot be represented with n digits is when x = −10n/2, so
−x = 10n/2, that is to say−x = 0.10n +(5)·10n−1 +0.10n−2 +
· · ·+ 0.100). The computation of the representation of −x is
based on the following property.
Assuming x to be represented as an n-digit 10’s comple-
ment number R(x),−x may be readily computed as
−x = 10n+1 − R(x). (16)
A straightforward inversion algorithm then consists in
representing x with n + 1 digits, complementing every digit
to 9, then adding 1. Observe that sign extension is obtained
by adding a digit 0 to the left of a positive number or 9 for a
negative number, respectively.
5. Base-10 Adders
5.1. Base-10 Ripple-Carry Adders. For B = 10, the classic and
naı¨ve approach [8] of ripple-carry for a BCD decimal adder
cell can be implemented as in Figure 5. Observe that the
critical path involves the carry propagation through 7 binary
adders plus a 4-bit Boolean circuit (checking if the sum s is
greater than 9 or not).
5.2. Base-10 Carry-Chain Adders. If B = 10, the carry-chain
circuit remains unchanged but the P and G functions as well
as the modulo-10 sums are somewhat more complex. In base
2, the mod B sum cell appears to be a single XOR function,
while the mod 10 sum cell is more complex as suggested by
Figure 5.
In base 2, the P and G cells are, respectively, synthesized
by XOR and AND functions, while in base 10, P and G are
now defined as follows:
P(i) ≡
⎧
⎨
⎩
P
(
x(i), y(i)
) = 1 if x(i) + y(i) = 9,
P
(
x(i), y(i)
) = 0 otherwise;
G(i) ≡
⎧
⎨
⎩
G
(
x(i), y(i)
) = 1 if x(i) + y(i) > 9,
G
(
x(i), y(i)
) = 0 otherwise.
(17)
A straightforward way to synthesize P and G is shown
at Figure 6. Nevertheless, functions P and G may be directly
computed from x(i) and y(i) inputs. The following formulas
(18) are Boolean expressions of conditions (17),
P(i) = p0 ·
[
k1 ·
(
p3 · k2 ∨ k3 · g2
)∨ g1 · k3 · p2
]
G(i) = g0 ·
[
p3 ∨ g2 ∨ p2 · g1
]∨ g3 ∨ p3 · p2 ∨ p3 · p1
∨ g2 · p1 ∨ g2 · g1
(18)
where pj = xj⊕yj , gj = xj ·yj , and kj = x′j ·y′j are the binary
propagator, generator, and carry-kill for the jth components
of the BCD digits x(i) and y(i).
The BCD carry-chain adder ith cell is shown at Figure 7.
It is made of a first mod 16 adder stage, a carry-chain cell
driven by the G-P functions, and an output adder stage
performing a correction (adding 6) whenever the carry-out
is one. Actually, a zero carry-out c(i+1) identifies that the
mod 16 sum does not exceed 9 if c(i) = 0, respectively, 8 if
c(i) = 1; so no corrections are needed. Otherwise, the add-6
correction applies.
The G-P functions may be computed according to
Figure 6, using the outputs of the mod 16 stage, including the
carry-out s4. With more hardware consumption, but saving
time delays, formulas (18) may be used.
International Journal of Reconfigurable Computing 5
FA FA FA FA
FA HAHA
x3(i) y3(i) x2(i) y2(i) x1(i) y1(i) x0(i) y0(i)
s4
s3 s2 s1 s0
s4 ∨ s3 · (s2 ∨ s1)
c(i + 1)
c3 c2
c(i)
z3(i) z2(i) z1(i) z0(i)
Figure 5: Ripple-Carry BCD adder cell.
FA FA FA HA
x3 y3 x2 y2 x1 y1 x0 y0
s4 s3 s2 s1 s0
P
G
P = s′4 · s3 · s′2 · s′1 · s0
G = s4 ∨ s3 · (s2 ∨ s1)
Figure 6: G-P cell for BCD adder.
FA FA FA HA
Carry
chain
FA FA FA HA
x3(i) y3(i) x2(i) y2(i) x1(i) y1(i) x0(i) y0(i)
· · ·
P-G
p(i) g(i)
c(i + 1) c(i)
Figure 7: Carry-chain BCD adder ith cell.
6. FPGA Implementations of theBase-10Adders
on 4-Input LUTs Xilinx Platforms
The base-10 adders of Figures 5 and 7 have been imple-
mented on 4-LUTs (Look-Up Tables up to 4 inputs) Xilinx
devices. Virtex-4, Spartan 3, and the obsolete Virtex-2, Virtex
and Spartan 2 are 4-input LUTs-based FPGA [6, 10]. In what
follows the area is expressed in LUTs. In the Xilinx Virtex-4
technology a configurable logic block (CLB) involves 4 slices
and a slice is made by two 4-LUTs and some additional logic.
VHDL models are available at [11].
6.1. Base-10 Ripple-Carry Adder. The classic implementation
of the ripple carry adder cell in FPGA implies a 4-bit adder, a
4-LUT to detect the carry condition, and a final 3-bit adder.
The delay and area consumption of an N-digit ripple carry
adder are
TN-digit-B10-rc-adder = N ·
[
TLUT + 4 · Tmux-cy + TXOR
+ Tcon + TLUT + Tcon + TLUT
+3 · Tmux-cy + TXOR + Tcon
]
,
CN-digit-B10-rc-adder = 8 ·N LUTs.
(19)
6.2. FPGA Implementation of the Base-10 Carry-Chain Adder.
In order to make the best use of the resources, the design
has been achieved using relative location techniques (RLOC)
[12] with low-level component instantiations. This first
architecture is called GP a.
The adding stages are implemented as shown at Figures
8(a) and 8(b) while the carry-chain structure with the G-P
functions has been implemented as shown at Figure 9 where
G is computed according to Figure 6, while P is computed as
P = s3 · s0 ·G′ (20)
equivalent to the expression of Figure 6. Figure 9 emphasizes
that G depends on s1, s2, s3, and s4 while P is computed
from s0, s3, and G.
6 International Journal of Reconfigurable Computing
LUT
LUT
LUT
LUT
Slice (k, l + 1)
Slice (k, l) 4-bit adder
x3(i)
y3(i)
x2(i)
y2(i)
x1(i)
y1(i)
x0(i)
y0(i)
p3(i)
p2(i)
p1(i)
p0(i)
0 1
0 1
0 1
0 1
s4(i)
s3(i)
s2(i)
s1(i)
s0(i)
0
(a)
LUT
LUT
LUT
LUT
Slice (k, l + 1)
Slice (k, l) Output adder
s3(i) s3(i)
s2(i)
c(i + 1)
XOR2
z3(i)
z2(i)
s1(i)
c(i + 1)
XOR1
z1(i)
z0(i)
s0(i) s0(i)
c(i)
0 1
0 1
0 1
(b)
Figure 8: FPGA implementation of adders: (a) 4-bit adder stage and (b) output adder stage.
The time delay corresponding to the 4-bit adder stage
(Figure 8(a)) and the output adder stage (Figure 8(b)) is
given as
T4-bit adder = TLUT + 4 · Tmux-cy, (21)
Toutput adder = TLUT + 3 · Tmux-cy + TXOR. (22)
Both adder stages of Figures 8(a) and 8(b) need the
same hardware requirement; computed in slices, the area
consumption is given as
C4-bit adder = Coutput adder = 4 LUTs. (23)
The complexity figures of the carry-chain circuit for a 4-
digit unit, as shown at Figure 9, are given as
TCy-Ch-a = 2 · TLUT + Tcon1 + 4 · Tmux-cy, (24)
CCy-Ch-a = 8 LUTs, (25)
where Tcon1 stands for the average connection delay between
two neighboring slices of the same CLB.
The overall circuit is represented in Figure 10. The overall
time delay is computed from formulas (21), (22) and (24):
TN-digit adder-a = 4 · TLUT + (N + 7) · Tmux-cy + TXOR
+ Tcon1 + 2 · Tcon2,
(26)
where Tcon2 stands for the average connection delays between
two slices located in neighbor columns. Tcon2 has to be
accounted twice to involve both the connection delay
between the 4-bit adder and the carry-chain and the one
between the carry chain and the output adder.
From (23) and (25), the area requirement may be
computed as
CN-digit adder-a = 10 ·N LUTs. (27)
6.3. Other Implementations of Base-10 Carry-Chain Adders.
Functions P and G may be directly computed from x(i) and
International Journal of Reconfigurable Computing 7
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
BCD carry-chain
Slice (k, l + 1)
Slice (k, l)
Slice (k + 1, l + 1)
Slice (k + 1, l)
s4(i + 3)
s3(i + 3)
s2(i + 3)
s1(i + 3)
s4(i + 2)
s3(i + 2)
s2(i + 2)
s1(i + 2)
G(i + 3)
G(i + 2)
s4(i + 1)
s3(i + 1)
s2(i + 1)
s1(i + 1)
s4(i)
s3(i)
s2(i)
s1(i)
G(i + 1)
G(i)
s3(i + 3)
s0(i + 3)
s3(i + 2)
s0(i + 2)
P(i + 3)
P(i + 2)
s3(i + 1)
s0(i + 1)
s3(i)
s0(i)
P(i + 1)
P(i)
0 1
0 1
0 1
0 1
c(i)
c(i + 4)
c(i + 3)
c(i + 2)
c(i + 1)
Figure 9: FPGA implementation of the carry-chain circuit (GP a architecture).
y(i) inputs using the Boolean expression (18). Using 4-input
LUTs (4-LUTs), a first implementation (Figure 11) computes
c = (x2 ⊕ y2
) · (x′3 · y′3
)
,
b = (x′2 · y′2
) · (x3 ⊕ y3
)∨ (x2 · y2
) · (x′3 · y′3
)
,
a = (x′1 · y′1
) · b∨ (x1 · y1
) · c, prop = (x0 ⊕ y0
) · a,
h = (x0 · y0
) · (x1 · y1
)
, g = (x0 · y0
)∨ x1 ∨ y1,
e = (x2 · y2
) · g ∨ h · (x2 ⊕ y2
)
,
f = (x1 ⊕ y1
)∨ (x2 ⊕ y2
)
,
d = (x0 · y0
)∨ f , gen = (x3 · y3
)∨ (x3 ⊕ y3
) · d ∨ e.
(28)
This architecture called GP b is shown in Figure 11. The
corresponding time and area of a carry-chain cell using this
architecture is
TCy-Ch-b = 4 · TLUT + 3 · Tcon1 + Tmux-cy,
CCy-Ch-b = 10 LUTs.
(29)
The complete cell includes a 4-bit adder and a con-
ditional 3-bit output adder adding 6 whenever necessary
(similar to Figure 5). The overall time delay and area
consumption using this carry-computation cell is:
TN-digit adder-b = 4 · TLUT + (N + 3) · Tmux-cy + TXOR
+3 · Tcon1 + Tcon2,
(30)
CN-digit adder-b = 18 ·N LUTs. (31)
8 International Journal of Reconfigurable Computing
The results in area and speed are poor compared to the
GP a implementation (obtaining G-P from the results of the
4-bit adder).
Another alternative is based on the use of dedicated
multiplexers. Xilinx Spartan 3, Virtex-2, and Virtex-4 devices
have Look-Up Table multiplexers (muxf5, muxf6, muxf7,
muxf8) in order to construct functions of 5, 6, 7, and 8
variables without using the general purpose routing fabric.
Using this feature the circuit of Figure 12 (GP c) can be
implemented using the following relations:
b = y′3 · y′2 · y′1; d = x′2 · x′1;
c = y3 · d ∨ y′3 · e; a = x3 · b ∨ x′3 · c;
p1 = (x′0 ∨ y′0
) · a,
e = (x2 · x1 · y′2 · y1
)∨(x′2 · x1 · y2 · y1
)∨(x2 · x′1 · y2 · y′1
)
,
k = (x2 · y2
)∨ (x1 · y1
) · (x2 ∨ y2
)
; h = j = 1;
i = y3 · j ∨ y′3 · k; g1 = x3 · h∨ x′3 · i;
p2 = (x0 ∨ y0
)
.
(32)
The corresponding time and area of a carry-chain cell
GP c is
TCy-Ch-c = T6-LUT + TLUT + Tcon1 + Tmux-cy,
CCy-Ch-c = 8 LUTs,
(33)
where T6-LUT stands for the delay from an LUT input to a
muxf6 output. The complete cell also includes 4-bit adder
and a conditional 3-bit adder. The overall delay-area for GP c
cell is
TN-digit adder-c = T6-LUT + TLUT + (2 ·N + 3) · Tmux-cy
+ TXOR + Tcon1 + Tcon2,
(34)
CN-digit adder-c = 16 ·N LUTs. (35)
7. FPGA Implementations of Base-10 Adders
and Adders-Subtractors on 6-Input LUTs
Xilinx Platforms
7.1. Base-10 BCD Carry-Chain Adder. In a first version, Ad-
I, the adding stage and correction stage are implemented
as shown at Figures 8(a) and 8(b), respectively, while the
carry-chain structure with the G-P functions is computed
according to Figure 6.
Xilinx Virtex-5 6-input/2-output LUT is built as two 5-
input functions while the sixth input controls a 2-1 multi-
plexor allowing to implement either two 5-input functions
or a single 6-input one; so G and P functions fit in a single
LUT as shown at Figure 13.
In a second version, Ad-II, the carry-chain is speeded up
thanks to a direct computation of the G-P, namely, using
inputs x(i) and y(i), instead of the intermediate sum bits sk.
For this purpose one could use formulas (18); nevertheless,
in order to minimize time and hardware consumption the
implementation of P(i) and G(i) is revisited as follows.
Remembering that P(i) = 1 whenever the arithmetic sum
x(i) + y(i) = 9, one defines a 6-input function pp(i) set to be
1 whenever the arithmetic sum of the first 3 bits of x(i) and
y(i) is 4. Then P(i) may be computed as
P(i) = (x0(i)⊕ y0(i)
) · pp(i). (36)
On the other hand, gg(i) is defined as a 6-input function
set to be 1 whenever the arithmetic sum of the first 3 bits of
x(i) and y(i) is 5 or more. So, remembering that G(i) = 1
whenever the arithmetic sum x(i) + y(i) > 9,G(i) may be
computed as
G(i) = gg(i)∨ (pp(i) · x0(i) · y0(i)
)
. (37)
As Xilinx Virtex-5 LUTs may compute 6-variable func-
tions, then gg(i) and pp(i) may be synthesized using 2 LUTs
in parallel while G(i) and P(i) are computed through an
additional single LUT as shown at Figure 14.
7.2. 10’s Complement BCD Carry-Chain Adder-Subtractor.
To compute X +Y similar algorithm as in Section 7.1 is used.
In order to compute X–Y , 10’s complement subtraction
algorithm actually adds (−Y) to X .
7.2.1. 10’s Complement (AS-I). 10’s complement sign change
algorithm may be implemented through a digitwise 9’s
complement stage followed by an add-1 operation. It can
be shown that the 9’s complement binary components
z3, z2, z1, and z0 of a given BCD digit y3, y2, y1, and y0
are expressed as
z3 = y′3 · y′2 · y′1; z2 = y2 ⊕ y1; z1 = y1; z0 = y′0. (38)
To compute X–Y , 10’s complement subtraction algo-
rithm actually adds (−Y) to X . So for a first implementation,
AS-I, Figure 15 presents a 9’s complement implementation
using 6-input/2-output LUTs, available in the Virtex-5 Xilinx
technology. A′/S is the add/subtract control signal; if A′/S =
1 (subtract), formulas in (38) apply; otherwise A′/S = 0 and
zj(i) = yj(i) for all i, j.
The AS-I circuit is similar to the Ad-I (Figures 8 and 13)
using, instead of input Y , the input Z as produced by the
circuit of Figure 15.
7.2.2. Improving the Adder Stage. To avoid the delay pro-
duced by the 9’s complement step, this operation may be
carried out within the first binary adder stage, as depicted
in Figure 16, where p(i) and g(i) are computed as
p0(i) = x0(i)⊕ y0(i)⊕ (A′/S),
p1(i) = x1(i)⊕ y1(i),
International Journal of Reconfigurable Computing 9
4-bit 
adder
4-bit 
adder
4-bit 
adder
4-bit 
adder
4-bit 
adder
4-bit 
adder
Output
adder
Output
adder
Output
adder
Output
adder
Output
adder
Output
adder
BCD carry chain
0 1
0 1
0 1
0 1
0 1
0 1
x3:0(N − 1)
y3:0(N − 1)
x3:0(N − 2)
y3:0(N − 2)
· · · · · ·
... · · ·
x3:0(3)
y3:0(3)
x3:0(2)
y3:0(2)
x3:0(1)
y3:0(1)
x3:0(0)
y3:0(0)
s4:0(N − 1)
s4:0(N − 2)
s4:0(3)
s4:0(2)
s4:0(1)
s4:0(0)
cin = c(0)
P
and
G
P
and
G
P
and
G
P
and
G
P
and
G
P
and
G
p(N − 1)
g(N − 1)
p(N − 2)
g(N − 2)
p(3)
g(3)
p(2)
g(2)
p(1)
g(1)
p(0)
g(0)
c(N)
c(N − 1)
c(N − 2)
c(4)
c(4)
c(3)
c(2)
c(1)
c(1)
c(0)
cout
s3:0(N − 1)
s3:0(N − 2)
s3:0(3)
s3:0(2)
s3:0(1)
s3:0(0)
z3:0(N − 1)
z3:0(N − 2)
z3:0(3)
z3:0(2)
z3:0(1)
z3:0(0)
Figure 10: FPGA implementation of an N-digit BCD Adder.
p2(i) = x2(i)⊕ y2(i)⊕ y1(i) · (A′/S),
p3(i) = x3(i)⊕
(
y3(i)
′ · y2(i)′ · y1(i)′
)
· (A′/S)
⊕ y3(i) · (A′/S)′,
gk(i) = xk(i), ∀k.
(39)
7.2.3. Carry-Chain Stage Computing G and P Directly from
the Input Data (AS-II). As far as addition is concerned,
the P and G functions may be implemented according to
formulas (36) and (37). The idea of the AS-II is computing
the corresponding functions in the subtract mode and then
multiplexing according to the add/subtract control signal
A′/S. For this reason, assuming that the operation at hand
is X + (±Y), one defines on one hand ppa(i) and gga(i)
10 International Journal of Reconfigurable Computing
LUTLUTLUT
LUT
LUTLUT
LUT
LUT
LUT
LUT
x1
y1
x2
y2
x0
y0
x1
y1
x0
y0
x1
y1
f
g
h
x2
y2
x3
y3
x2
y2
x3
y3
x0
y0
x2
y2
b
c
d
e
x1
y1
x3
y3
a
Gen
x0
y0 Prop
c(i + 1)
0 1
c(i)
Figure 11: Carry generation (GP b architecture).
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
x3
y3
y2
y1
y3
x2
x1
x2
y2
x1
y1
x3
y3
x2
y2
x1
y1
a
b
c
d
e
0
1
0
1
0
1
i
g1
x0
y0
g1
p1
x0
y0 p2
0 1
0 1
c(i + 1)
“0”
c(i)
h = “1”
j = “1”
0
1
k
Figure 12: Carry generation (GP c architecture).
according to Section 7.1, that is, using the straight values of
Y ’s BCD components.
On the other hand, pps(i) and ggs(i) are defined accord-
ing to the same Section 7.1 but using zk(i) as computed
by the 9’s complement circuit shown at Figure 15. As zk(i)
are expressed from the yk(i) (38), both pps(i) and ggs(i)
may be computed directly from xk(i) and yk(i) as shown
in Figure 17. Nevertheless, for subtraction, the computation
of z0(k) = y′0(k) is carried out at the output LUT level. So
formulas (36) and (37) are then expressed as
P(i) = (x0(i)⊕ y0(i)⊕ (A′/S)
) · pp(i),
G(i) = gg(i)∨ (pp(i) · x0(i) ·
(
y0(i)⊕ (A′/S)
))
.
(40)
International Journal of Reconfigurable Computing 11
LUT
s4(i)
s3(i)
s2(i)
s1(i)
s0(i)
P(i)
G(i)
c(i + 1)
c(i)
0 1
Figure 13: FPGA carry-chain circuit for Ad-I.
LUT
LUT
LUT
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
x0(i)
y0(i)
pp(i)
gg(i)
p(i)
G(i)
c(i + 1)
0 1
c(i)
Figure 14: FPGA carry-chain circuit for Ad-II.
8. Experimental Results
8.1. Xilinx Virtex-4 Adder Implementations. The base-10
adders have been implemented on Xilinx Virtex-4 FPGA
family speed grade-11 [6]. The Synthesis and implementa-
tion have been carried out on XST (Xilinx Synthesis Technol-
ogy) [13] and Xilinx ISE (Integrated System environment)
version 10.1 [14].
Performances of diﬀerent N-digit BCD adders have been
compared to those of an M-bit binary carry chain adder
(implemented by XST [13] using Xilinx fast carry logic)
covering the same range, that is, as
M =
⌈
N · log2(10)
⌉ ∼= 3.322×N. (41)
The time and hardware complexities of an M-bit ripple-
carry adder implemented on the same 4-LUT based Xilinx
FPGA are given by
TM-bit adder = TLUT + M · Tmux-cy,
= TLUT + 3.322×N × Tmux-cy.
(42)
CM-bit adder =M LUTs = 3.322×N LUTs (43)
Formulas (26), (30), (34), and (42) show that, asymptot-
ically, TN-digit adder should be somewhat inferior to TM-bit adder.
Nevertheless, as shown by the experimental results, the
additive values appearing in (26), (30), and (34) are not
negligible for reasonable values of N; so the saving in time
will mainly appear for applications where BCD-to-binary
coding and decoding operations play a significant role in the
overall delay.
LUT LUT
A′/S
y3(i)
y2(i)
y1(i)
z3(i)
z2(i)
A′/S
y1(i)
y0(i)
z1(i)
z0(i)
Figure 15: FPGA 9’s complement circuit for AS-I.
LUT
LUT
LUT
LUT
A′/S
y3(i)
y2(i)
y1(i)
x3(i)
A′/S
y2(i)
y1(i)
x2(i)
y1(i)
x1(i)
A′/S
y0(i)
x0(i)
p3(i)
p2(i)
p1(i)
p0(i)
s4(i)
s3(i)
s2(i)
s1(i)
s0(i)
0 1
0 1
0 1
0 1
0
Figure 16: FPGA implementation of the adder stage for a 10’s
complement BCD adder-subtractor.
Post place-and-route time delays and area consumptions
are quoted in Tables 1 and 2, respectively, where N stands for
the number of BCD digits while M stands for the number of
bits required to cover the decimal N-digit range. The results
presented in the table are as follows:
(i) Ripple: Naı¨ve implementation of base-10 ripple-carry
(Section 5.1, Figures 1 and 5),
(ii) PG a: base-10 carry chain using an adder to produce
P-G values (Section 5.2, Figures 2, 6, 8, 9, and 10),
(iii) PG b: base-10 carry chain computing directly P-G
values (Section 6.3, Figures 2 and 11),
(iv) PG c: base-10 carry chain computing directly P-G
values using muxf5 and muxf6 (Section 6.3, Figures
2 and 12),
(v) M-bit Binary: base-2 carry chain adder covering the
same range as an N-digit adder.
12 International Journal of Reconfigurable Computing
LUT
LUT
LUT
LUT
LUT
1
0
0 1
1
0
A′/S
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
A′/S
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
pps(i)
ppa(i)
ggs(i)
gga(i)
pp(i)
gg(i)
x0(i)
y0(i) p(i)
G(i)
c(i + 1)
c(i)
Figure 17: FPGA implementation of the carry-chain stage AS-II for BCD adder-subtractor.
Figure 18 shows the delays for the compared adders.
Observe that, for the technology at hand, Table 1 and
Figure 18 suggest that for N > 48 the carry-chain decimal
implementation of adders is faster than the binary one for the
equivalent range. Furthermore for small numbers of digits
to add (N < 40) the PG c architecture is faster than other
decimal implementations.
8.2. Virtex-5 Adder-Subtractor Implementations. The adder-
subtractor circuits have been implemented on Xilinx Virtex-
5 family with speed grade-2 [7]. The synthesis and imple-
mentation have been carried out on XST (Xilinx Synthe-
sis Technology) [13] and Xilinx ISE (Integrated System
environment) version 10.1 [14]. The critical parts were
designed using low-level components instantiation (lut6 2,
muxcy, xorcy, etc.) in order to obtain the desired behavior.
Performances of diﬀerent N-digit BCD adders have been
compared to those of an M-bit binary carry chain adder
(implemented by XST) covering the same range, that is, such
that M = 
N · log2(10) ∼= 3.322 N.
Table 1: Time delays (nsec) for diﬀerent adders in Virtex-4 -11.
N
N-digit BCD adder
M M-bit Binary
Ripple PG a PG b PG c
8 14 5.8 6.7 4.5 27 2.7
12 20 6.0 6.8 4.8 40 3.2
16 27 6.1 7.0 5.1 54 3.7
24 40 6.4 7.3 5.7 80 4.7
32 53 6.7 7.6 6.3 107 5.7
40 66 7.0 7.9 6.9 133 6.6
48 79 7.3 8.2 7.5 160 7.6
64 105 7.9 8.7 8.7 213 9.6
96 — 9.1 9.9 11.0 319 13.5
128 — 10.3 11.1 13.4 426 17.5
Table 3 exhibits the postplacement and routing delays
in ns for the decimal adder implementations Ad-I and
Ad-II of Section 7.1; Table 4 exhibits the delays in ns for
International Journal of Reconfigurable Computing 13
1281129680644832160
Decimal digits
Ripple
PG b
Binary
PG a
PG c
0
5
10
15
20
25
(n
s)
Figure 18: Delay in ns for diﬀerent adders in Virtex-4.
Table 2: Area (in LUTs) for diﬀerent adder’s size in Virtex-4.
Adder Circuit # LUT’s
Ripple 8 ×N
PG a 10 ×N
PG b 18 ×N
PG c 16 ×N
Binary 
3.32×N
Table 3: Delays in ns for decimal and binary adders in Virtex-5 -2.
N (BCD digits) Ad-I Ad-II M (bits) Binary Adder
8 3.8 3.6 27 2.1
16 4.4 4.0 54 2.6
32 5.0 4.5 107 3.8
48 5.2 5.1 160 5.1
64 5.6 5.2 213 6.6
96 6.1 5.9 319 8.7
the decimal adder-subtractor implementations AS-I and AS-
II of Section 7.2. Table 5 lists the consumed areas expressed
in terms of 6-input look-up tables (6-input LUTs). The esti-
mated area presented in Table 5 was empirically confirmed.
Comments. Observe that, for large operands, the decimal
operations are faster than the binary ones.
The overall area with respect to binary computation is
not negligible. In Virtex-4 the area increases, with respect to
an equal range binary adder, in a factor between 2.4 and 5.4.
In the 6-input LUT family Virtex-5 an adder-subtractor is
between 3.0 and 3.9 times bigger.
Table 4: Delays in ns for decimal and binary adder-subtractor in
Virtex-5 -2.
N (BCD digits) AS-I AS-II M (bits) Binary Add-Sub
8 3.8 3.8 27 2.1
16 4.1 4.0 54 2.6
32 4.7 5.0 107 3.8
48 5.3 5.2 160 5.2
64 5.7 5.5 213 6.6
96 6.3 6.2 319 8.8
Table 5: Area in 6-input LUTs for diﬀerent adders and adders-
subtractors.
Circuit # LUTs
Adder Ad-I 8 ×N
Adder Ad-II 10 ×N
Binary Adder 
3.32×N
Adder-Subtractor AS-I 10 ×N
Adder-Subtractor AS-II 13 ×N
Binary Adder-Subtractor 
3.32×N
9. Conclusions
The present interest in BCD arithmetic systems stimulates
further researches at both the algorithmic and design levels.
Considering that the hardware costs are everyday more
aﬀordable, full hardware BCD units are now very attractive,
with moreover a growing potential in the near future.
This paper has developed some implementations of BCD
adders and subtractors in FPGA platforms. Experimental
results emphasize time performances with reasonable costs
in terms of area. Matched with the binary system, the decimal
implementations are faster as operand sizes are growing
(break even around 50 digits).
One of the key points about delays comes from the
fact that the carry-propagation computation remains binary;
then a faster carry-chain circuit can be designed because,
for the same operand range, the number of digits (therefore
of carries to propagate) is lower in decimal than in binary.
In the carry-chain structures studied in this paper, the
propagate P and generate G functions are more complex
and therefore more time and area consuming than in the
binary ones; therefore, the speed improvements only appear
for large enough operands. The breakeven point is obviously
technology dependent; so it could be expected to occur for a
smaller number of digits in the near future.
The area overhead with respect to binary computation
is not negligible; it is around five times in Virtex-4 and
nearly four times in Virtex-5. That is mainly due to the more
complex definition of the carry propagate and carry generate
functions and to the final mod 10 reduction. The decreasing
costs of technology make hardware consumption less central.
For BCD addition, the performance considerations on
Xilinx Virtex-5 platform are similar to those of 4-input LUTs-
based Virtex-4 technology. That is, the addition time of BCD
digits remains faster than the binary counterpart in the same
conditions.
14 International Journal of Reconfigurable Computing
Finally, the BCD adder/subtractor, with a relatively small
penalty in area, presents time performances quite similar to
those of a straight BCD adder.
Acknowledgments
This work is supported by the Universities FASTA, Mar del
Plata, Argentina, UNCPBA Tandil, Argentina, UAM, Madrid,
Spain, and URV, Tarragona, Spain; it has been partially
granted by the CICYT of Spain under contract TEC2007-
68074-C02-02/MIC.
References
[1] M. F. Cowlishaw, “Decimal floating-point: algorism for
computers,” in Proceedings of the 16th IEEE Symposium on
Computer Arithmetic, pp. 104–111, June 2003.
[2] G. Jaberipur and A. Kaivani, “Binary-coded decimal digit
multipliers,” IET Computers and Digital Techniques, vol. 1, no.
4, pp. 377–381, 2007.
[3] M. Cowlishaw, “General Decimal Arithmetic,” http://
speleotrove.com/decimal/.
[4] F. Y. Busaba, C. A. Krygowski, W. H. Li, E. M. Schwarz, and
S. R. Carlough, “The IBM Z900 decimal arithmetic unit,” in
Proceedings of the Asilomar Conference on Signals, Systems and
Computers, vol. 2, pp. 1335–1339, November 2001.
[5] IEEE Standard for Floating-Point Arithmetic (IEEE 754), IEEE,
2008.
[6] Xilinx Inc., “Virtex-4 User Guide,” April 2007, http://www.
xilinx.com/.
[7] Xilinx Inc., “Virtex-5 User Guide,” 2008, http://www.xilinx
.com/.
[8] J.-P. Deschamps, G. Bioul, and G. Sutter, Synthesis of Arith-
metic Circuits: FPGA, ASIC and Embedded Systems, John Wiley
& Sons, New York, NY, USA, 2006.
[9] B. Parhami, Computer Aritmethic: Algorithms and Hardware
Designs, Oxford University Press, Oxford, UK, 2000.
[10] Xilinx Inc., Xilinx, http://www.xilinx.com/.
[11] “Decimal Arithmetic in FPGA,” http://arithmetic-circuits.org/
decimal/.
[12] Xilinx Inc., Constraints Guide—ISE9.2i, chapter 2, Relative
Location (RLOC), 2008.
[13] Xilinx Inc., “XST User Guide-10.1i,” 2008, http://www.xilinx
.com/.
[14] Xilinx Inc., “ISE 10.1 Documentation,” 2008, http://www.
xilinx.com/.
Submit your manuscripts at
http://www.hindawi.com
VLSI Design
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 International Journal of
 Rotating
Machinery
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation 
http://www.hindawi.com
 Journal of
Engineering
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Shock and Vibration
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Mechanical 
Engineering
Advances in
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Civil Engineering
Advances in
Acoustics and Vibration
Advances in
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Electrical and Computer 
Engineering
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Distributed 
 Sensor Networks
International Journal of
The Scientific 
World Journal
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2014
Sensors
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Modelling & 
Simulation 
in Engineering
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 Active and Passive  
Electronic Components
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Chemical Engineering
International Journal of
Control Science
and Engineering
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 Antennas and
Propagation
International Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Navigation and 
 Observation
International Journal of
Advances in
OptoElectronics
Hindawi Publishing Corporation 
http://www.hindawi.com
Volume 2014
Robotics
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
