Fast parallel-prefix architectures for modulo 2n - 1 addition with a single representation of zero by Patel RA et al.
Fast Parallel-Prefix Architectures for
Modulo 2n  1 Addition
with a Single Representation of Zero
Riyaz A. Patel, Mohammed Benaissa, Senior Member, IEEE, and
Said Boussakta, Senior Member, IEEE
Abstract—Novel modulo 2n  1 addition algorithms for residue number system (RNS) applications are presented. The proposed
algorithms depart from the traditional approach of modulo 2n  1 addition by setting the input carry in the first stage of the addition to
one, which only ever produces one representation of zero. The resulting architectures not only offer significant speedup in a modulo
2n  1 addition, but they can also offer a reduction in area and thus provide improvements in the cost functions area delay2 and
energy delay. The superiority of these architectures is validated through back-annotated VLSI designs using 130 nm CMOS
technology.
Index Terms—Modulo 2n  1 adders, one’s complement adders, parallel-prefix adders, computer arithmetic, VLSI design.
Ç
1 INTRODUCTION
CIRCUITS for modulo 2
n  1, or, equivalently, one’s
complement, addition play an important role in
various applications. The corresponding architectures are
desirable for the efficient implementation of bespoke digital
signal processing (DSP) and communications systems using
the residue number system (RNS) [1], [2]. They are also
used in fault-tolerant computer systems [3], in error
detection in computer networks [4], and in floating-point
arithmetic [5]. In general, modulo 2n  1 addition is
performed using a carry propagate adder (CPA) with
end-around-carry (EAC), which produces a double repre-
sentation of zero, that is, both the positive zero representa-
tion f00 . . . 0g and the negative zero representation f11 . . . 1g
are produced [6]. For correct functionality, the single
representation of zero is not necessarily required during
the intermediate stages of computation in an RNS circuit.
Furthermore, single zero correction also adds an additional
complexity to the design. However, if the all 1s representa-
tion of zero is encountered during computation, then
additional unnecessary signal activity results in stages
following the addition as the data is autonomously brought
back into the dynamic range of the modulus. This
phenomenon increases the amount of switched capacitive
power in the corresponding circuit. In addition, single zero
representing modulo 2n  1 adders are needed in certain
residue to binary converters for efficient moduli sets [7], [8].
Hence, investigations leading to optimized single-zero
representing modulo 2n  1 adders are required for the
optimized implementation of bespoke DSP and commu-
nications systems using RNS.
In RNS, an integer A 2 ½0;QN1i¼0 mi  1 is uniquely
represented by a vector of residues fa0; a1; . . . ; aN1g, where
ai is the residue of Awith respect to the modulusmi, that is,
ai ¼ jAjmi for i 2 ½0; N  1. The arithmetic operation
C ¼ A } B, where }-operator represents addition, subtrac-
tion, ormultiplication, is isomorphic toC ¼ fc0; c1; . . . ; cN1g,
where ci ¼ jai } bijmi for i 2 ½0; N  1. As ci is solely
dependent upon ai and bi, the } operation is performed
in parallel with no interaction between the RNS channels,
which leads to fast arithmetic implementations. Moreover,
aside from offering improved speed performance, the
parallelization also unfolds design choices that result in
low-power DSP implementations [9]. The use of modulo
2n  1 arithmetic for RNS is attractive due its relatively low
complexity [10], [11] and due to the existence of efficient
residue decoders for moduli sets that include moduli of this
form [7], [8], [12].
Several researchers have addressed the problem of
modulo 2n  1 addition in the literature. In [6], one and
two-level carry-look-ahead (CLA) modulo 2n  1 adder
designs are presented. Thereafter, the majority of contribu-
tions have been based on the faster and more regular
parallel-prefix carry computation trees [13], [14], [15], [16],
[17], [10]. In [13] and [14], the EAC approach is used, where
the output carry from the integer addition is added via an
additional prefix level. Although this approach suffers from
a fan-out loading equal to n, it provides a good trade-off
performance for small and medium operand widths. The
parallel-prefix algorithm for n ¼ 2l in [15] utilizes the idea
of carry recirculation (CR) at each prefix level. The resulting
adders have the desired properties of minimum logic depth
and bounded fan out. Although fast, the corresponding
1484 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007
. R.A. Patel and M. Benaissa are with the Department of Electronic and
Electrical Engineering, University of Sheffield, Mappin Building, Mappin
Street, Sheffield S1 3JD, UK.
E-mail: {r.a.patel, m.benaissa}@sheffield.ac.uk.
. S. Boussakta is with the Department of Electronic Engineering, University
of Newcastle, Newcastle, NE1 7RU, UK.
E-mail: s.boussakta@newcastle.ac.uk.
Manuscript received 13 May 2005; revised 21 Apr. 2006; accepted 15 May
2007; published online 11 June 2007.
Recommended for acceptance by P. Gibbons.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-0156-0505.
Digital Object Identifier no. 10.1109/TC.2007.70750.
0018-9340/07/$25.00  2007 IEEE Published by the IEEE Computer Society
adders suffer from an increase in area overhead. The work
in [15] was extended in [16] to enable the design of modulo
2n  1 adders for every n. As a compromise between the
unbounded fan-out loading of EAC adders in [13] and [14]
and the increased area complexity of CR adders in [16] and
[15], the authors in [17] proposed area  delay2 efficient
modulo 2n  1 adders using select-prefix blocks. These
adders are suitable for medium and large operand widths
only. In [10], a whole family of minimum logic depth
parallel-prefix modulo 2n  1 adders was presented which
reduced the area overhead of [15] and [16] at the cost of
increased fan-out. Note that all of the adders mentioned
above can be implemented such that the all ones repre-
sentation of zero f11 . . . 1g never occurs.
In this paper, novel modulo 2n  1 addition algorithms
for single zero representation are introduced. The new
proposed class of algorithms is based on initially computing
an n-bit integer addition with the input carry set to one. Just
like that in [10], [15], [16], the resulting architectures
cyclically feed back generate and propagate signals at each
stage in the adder. The difference, however, is that the
proposed adders are not exposed to the additional delay
introduced when the adders in [10], [15], [16] are modified
for a single zero correction. Postplace and route back-
annotated designs using 130 nm CMOS technology show
that the proposed architectures are significantly faster and
provide improved trade-offs in area  delay2 ðAD2Þ and
energy delay ðEDÞ products when compared to pre-
viously reported parallel-prefix modulo 2n  1 adders. Note
that all contributions on modulo 2n  1 adders in the
literature have only considered performance in the delay-
area space. This paper provides an additional dimension to
the modulo 2n  1 adder complexity evaluation space by
also including power estimation information for all adders.
2 PRELIMINARIES
The key to fast radix-2 addition of two n-digit (or n-bit)
operands X ¼ fxn1xn2 . . .x0g and Y ¼ fyn1yn2 . . . y0g,
with an input carry cin, is in the reduction of the latency in
the carry network. In this regard, the actual operand bits are
not important. What is important is whether a carry is
generated or propagated from a given bit position. In order
to implement such an adder, the architecture can be divided
into three distinct stages. In the first stage, which is also
known as the preprocessing stage, carry generate ðgiÞ, carry
propagate ðpiÞ, and partial sum ðpsiÞ bits are computed for
every bit i, i 2 ½0; n 1, according to (1), where _ and 
represent the logical operators AND, OR, and XOR,
respectively:
gi ¼
x0  y0 _ cin  ðx0 _ y0Þ if i ¼ 0
xi  yi otherwise

pi ¼ xi _ yi
psi ¼ xi  yi:
ð1Þ
The second stage computes the carries ci, i 2 ½1; n 1, as
well as the output carry cout. In the final stage, the sum
S ¼ fsn1sn2 . . . s0g is computed according to
si ¼ psi  ci for i 2 ½0; n 1; ð2Þ
where c0 ¼ cin. As the first and the third stages are relatively
simple, the main problem in adder design lies in the fast
generation of the n carries. One possible method of reducing
the latency of carry computation is by considering it as a
prefix problem [18]. In a prefix problem, n inputs,
fen1; en2; . . . ; e0g, and an arbitrary associative operator, ,
are used to produce n outputs ffn1; fn2; . . . ; f0g according
to the relation fi ¼ ei  ei1  . . .  e0, where i 2 ½0; n 1. For
prefix addition, the  operator is defined as [18]
ðg; pÞ  ðg0; p0Þ ¼ ðg _ p  g0; p  p0Þ: ð3Þ
With the carry generate and propagate terms defined as
in (1), the carry into bit i, ci, is given by Gi1;0, where
ðGi1;0; Pi1;0Þ ¼ ðg0; p0Þ if i ¼ 1ðgi1; pi1Þ  ðGi2;0; Pi2;0Þ if i 2 ð1; nÞ:

ð4Þ
Note that, in general, Gi;k and Pi;k denote the group
generate and propagate signals over bit positions i to k,
where 0  k  i.
Many prefix structures in the literature parallelize the
computations of (4). Each of these structures results in a
parallel-prefix adder with a different depth (or critical path)
and size to the others and thus offers alternative power,
delay, and area trade-offs. The algorithms proposed by
Ladner and Fischer (LF) [18] and Kogge and Stone (KS) [19]
are the end cases of a family of parallel-prefix adders that
share the attractive property of processing all of the carries
using minimum logic depth prefix structures [20]. The
prefix structures for these adders are shown in Fig. 1, for
n ¼ 8, where the black nodes represent the  prefix operator
and the white nodes represent buffering nodes.
PATEL ET AL.: FAST PARALLEL-PREFIX ARCHITECTURES FOR MODULO 2n  1 ADDITION WITH A SINGLE REPRESENTATION OF ZERO 1485
Fig. 1. Parallel-prefix structures by (a) LF and (b) KS.
3 NOVEL PARALLEL-PREFIX MODULO 2n  1
ADDERS WITH SINGLE ZERO REPRESENTATION
The algorithms and the corresponding architectures for the
proposed parallel-prefix modulo 2n  1 adder will be
described in this section.
3.1 New Set of Algorithms for Fast Carry
Computation in Modulo 2n  1 Addition with
Single Zero Representation
Modulo 2n  1 addition with single zero representation can
be formulated as [13]
Z ¼ jX þ Y j2n1 ¼
jX þ Y þ 1j2n if X þ Y þ 12n
X þ Y otherwise;

ð5Þ
where X, Y , and Z are all n-bit unsigned integers less than
2n  1. There are two possible methods for implementing (5).
One method is to compute the two additionsX þ Y andX þ
Y þ 1 in parallel and thereafter using a multiplexer circuit to
choose the correct result. If the output carry from the addition
X þ Y þ 1 equals one, then the multiplexer chooses the
corresponding n-bit sum. Otherwise, the n-bit sum of X þ Y
is chosen. This approach is slower and larger compared to
the state-of-the-art in modulo 2n  1 adder design and is
thus not considered. An alternative method, is to compute
X þ Y þ 1 and then modify the result to produce the correct
sum. This method is akin to setting the input carry to one
for the addition X þ Y . From the outset, this approach
appears to be slower, as an n-bit addition is followed by
what effectively is a conditional decrement operation.
However, subsequent derivations will show that, by
cyclically feeding back the carry generate and propagate
signals at each prefix level in the adder, the proposed
adders offer a significant improvement in addition latency
over existing designs. Note that, by setting cin to zero,
modulo 2n  1 addition is performed by implementing [13]
Z ¼ jX þ Y j2n1 ¼ jX þ Y þ coutj2n ; ð6Þ
where cout is the output carry from the integer addition. In
this case, the all 1s representation of zero is encountered
and corrective hardware is thus required to accommodate
single zero representation. The traditional method for
implemented modulo 2n  1 addition is based on (6) and
we shall see that the additional correction results in inferior
performance when compared to the proposed adders.
When cin ¼ 1, the carries c1i can be computed as
c1i ¼ Gi1;0 _ Pi1;0 ¼ ci _ Pi1;0
¼ Gi1;1 _ Pi1;1  g0 _ Pi1;0
¼ Gi1;1 _ Pi1;1  p0;
ð7Þ
where the superscript 1 is used to denote signals corre-
sponding to the case cin ¼ 1. The absence of the superscript
in a signal notation represents signals that correspond to the
case cin ¼ 0. Knowing (from (5)) that the carries for the
modulo sum, cmi , are equal to the carries ci when the output
carry for the addition X þ Y þ 1, c1out, equals zero, the
correct carries for the modulo 2n  1 addition are given by
cmi ¼ ci  c1out _

ci _ Pi1;0
  c1out
¼ ci _ Pi1;0  c1out ¼ Gi1;0 _ Pi1;0  c1out;
ð8Þ
where i 2 ð0; nÞ. Equation (8) implies that one possible
method of computing carries for modulo 2n  1 addition for
single zero representation is to add an additional row of
prefix operators to the output of a parallel-prefix carry
computation tree. This approach is similar to Zimmerman’s
proposal in [13], with the difference that the reentering
carry in our case is the output carry generated when cin ¼ 1.
Note that the critical path for computing the carries in (8) is
theoretically equal to the carry computation critical path for
double zero representation in [13]. Thus, when taking into
consideration a single zero correction in [13], the modulo
2n  1 adder designed according to (8) reduces the critical
path by one logic level and thus results in a faster design.
By expanding the logic for c1out in (8) (using (7)), further
speedup of carry computation is also possible. Consider the
following:
cmi ¼ Gi1;0 _ Pi1;0 

Gn1;1 _ Pn1;1  p0

: ð9Þ
As Pi1;0  p0 ¼ Pi1;0, (9) becomes
cmi ¼ Gi1;0 _ Pi1;0 

Gn1;1 _ Pn1;1

; ð10Þ
where
Gn1;1 _ Pn1;1 ¼ Gn1;2 _ Pn1;2  ðg1 _ p1Þ ¼ Gn1;2 _ Pn1;1:
ð11Þ
Thus,
cmi ¼ Gi1;0 _ Pi1;0 

Gn1;2 _ Pn1;1

¼ Gi1;0 _ Pi1;0 

Gn1;i _ Pn1;i Gi1;2 _ Pn1;1

¼ Gi1;0 _ Pi1;0 

Gn1;i _ Pn1;i  ðGi1;2 _ Pi1;1Þ

:
ð12Þ
As Pi1;0  Pi1;1 ¼ Pi1;0, (12) becomes
cmi ¼ Gi1;0 _ Pi1;0 

Gn1;i _ Pn1;i  ðGi1;2 _ 1Þ

¼ Gi1;0 _ Pi1;0 

Gn1;i _ Pn1;i

:
ð13Þ
In contrast to the recursive algorithms for a single zero
representation presented in [10], [16], [15], the relation in
(13) demonstrates that propagate terms computed using an
XOR gate at level 0 are not required in the implementation
of modulo 2n  1 adders for single zero representation. This
will lead to a slightly faster implementation with no
additional area overhead.
In order to obtain even faster designs, (13) can be
further simplified by observing from (11) that
Gn1;i _ Pn1;i ¼ Gn1;iþ1 _ Pn1;i. Therefore,
cmi ¼ Gi1;0 _ Pi1;0 

Gn1;iþ1 _ Pn1;iþ1  pi

; ð14Þ
which, in terms of the  prefix operator, can be rewritten as
cmi ¼ ðgi1; pi1Þ  ðgi2; pi2Þ  . . .
. . .  ðg0; p0Þ  ðgn1; pn1Þ  . . .  ðgiþ1; piþ1Þ  ðpi; piÞ:
ð15Þ
This algorithm for cmi is identical to the one described in
[10], [16], [15] for a double zero representation except for the
last term ðpi; piÞ, which in [10], [16], [15] is defined as ðgi; piÞ.
1486 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007
Hence, the adder designed according to (15) for a single
zero representation can run as fast as the adders presented
in [10], [16], [15] for double zero representation.
Recall that
cm0 ¼ c1out ¼ ðgn1; pn1Þ  . . .  ðg1; p1Þ  ðp0; p0Þ: ð16Þ
If (15) is used for the general description for the modulo
2n  1 carry computation for all i including 0, then, for
i ¼ 0, the first term is equal to ðg1; p1Þ, which is not
defined. A more appropriate definition for cmi that
accurately defines the carries for i 2 ½0; nÞ is given by
cmi ¼ ðgji1jn ; pji1jnÞ  ðgji2jn ; pji2jnÞ  . . .
. . .  ðgjiijn ; pjiijnÞ  . . .  ðgjiðiþ1Þjn ; pjiðiþ1ÞjnÞ  . . .
. . .  ðgjiðn1Þjn ; pjiðn1ÞjnÞ  ðpjinjn ; pjinjnÞ:
ð17Þ
This equation can be readily validated through term-by-
term comparison with (15) and (16).
From the above analysis, it is clear that fast modulo 2n 
1 adders with a single representation of zero can be
designed for any of the existing modulo 2n  1 adders by
performing an operation equivalent to adding the carry
generated from the most significant bit (MSB) when the
input carry to the adder, cin, is 1 to the sum when the input
carry is 0.
3.2 Architecture Description for the Proposed
Modulo 2n  1 Addition Algorithms with Single
Zero Representation
This section aims to describe the hardware realization of the
modulo 2n  1 addition algorithms described in (13) and
(17), respectively. The architecture for (13), hereafter
denoted as Type-I, presents a modification to the final
row of prefix operators in [10], [16], [15] that leads to faster
implementations for single zero representation. In addition,
the proposed Type-II adder, which implements the algo-
rithm in (17), reduces the critical path such that it is
theoretically equal to the critical path of the double zero
representing adders in [10], [16], [15] and it thus provides a
significant speedup for single zero representation.
3.2.1 The Proposed Type-I Adder
The first m 1 levels, where m ¼ dlog2 ne, of the carry
computation structure for the proposed Type-I architecture
is identical to the first m 1 levels in the carry computation
structure of an entire family of parallel-prefix adders
described in [10], [16], [15]. There are three architectural
differences distinguishing the proposed designs from
existing designs:
. In the proposed architecture, propagate terms at
level 0 are computed using OR gates. In existing
designs, XOR gates are used to accommodate single
zero representation. This results in a slightly faster
proposed design.
. The prefix operators on the last row compute the
generate term g ¼ gi _ pi  ðgi1 _ pi1Þ. For the ad-
ders in [10], [16], [15], the prefix operators on the last
row implement g ¼ gi _ pi  gi1 and p ¼ pi _ pi1.
Thus, both sets of operators have similar area
requirements as the proposed operator, requiring
an additional OR gate in the critical path.
. The sum bits in the Type-I adder are implemented as
zi ¼ psi  cmi , where cmi is as defined in (13). For the
adders in [10], [16], [15], the sum bits are computed
using zi ¼ psi  PSn1;0  c	i . Hence, the final sum
computation logic in the existing designs requires one
additional AND gate per bit and is thus slower by one
gate level than the proposed Type-I adders. Note that,
aspropagate termsare computedusingORgates in the
Type-I adder, the overall area requirement is very
similar for all designs considered.
From the above, it is evident that the proposed Type-I
adders require an almost identical amount of silicon area
while at the same time being faster by virtue of the fact that
propagate terms are computed using OR gates at level 0.
Greater percentage improvements in speed are expected for
small wordlength adders, as is the case with RNS. As the
designmethodologies for the adders in [16], [15] are different
from that in [10], we will define two subclasses of the
proposed Type-I adders. Type-I-A is defined as being similar
to the adders described in [16], [15], whereas Type-I-B is
defined as being similar to the family of adders described in
[10]. As an illustration of the similarity between Type-I
adders and the existingdesigns, Fig. 2 shows thedesign of the
Type-I-A adder for themodulus 26  1. Note that the vertical
and lateral connections into the final row of prefix operators
for Type-I adders are identical to those in [16], [15], [10].
3.2.2 The Proposed Type-II Adder
The design approach in this case aims to construct the
generic circuit for the computation of cmi in (17). This circuit
will then be unrolled for each i, i 2 ½0; nÞ, resulting in a
complete carry computation unit for minimum logic depth
modulo 2n  1 addition with single zero representation.
The carry term cmi is computedusing an inverse binary tree
of prefix operators with depth m ¼ dlog2 ne. It is clear from
(17) that the length of the carry term cmi , that is, the number of
generate/propagate terms that are associated in its computa-
tion, is equal to n. Note that the Type-II architecture produces
all of the carrieswithoutusing the idempotencyproperty [19].
The tree structure for the computation of the carries can be
viewedasan interconnectionof subtrees,where the subtree at
level k, k 2 ½1;m, is an inverse binary treewith amaximumof
PATEL ET AL.: FAST PARALLEL-PREFIX ARCHITECTURES FOR MODULO 2n  1 ADDITION WITH A SINGLE REPRESENTATION OF ZERO 1487
Fig. 2. The proposed Type-I-A architecture for n ¼ 6.
2k inputs (root nodes) at level 0 and one solitary output (leaf
node) at level k.
If this subtree, with its leaf node on bit position j,
j 2 ½0; nÞ, connects to a node on level kþ 1 through a
vertical connection, then this subtree shall be denoted as
LSTk;j, that is, the left subtree (LST) with its leaf node at
coordinates defined by ðk; jÞ. Similarly, a lateral connection
from a leaf node at ðk; j1Þ to a node at ðkþ 1; jÞ, where
j1 ¼ jj 2kjn, will result in the corresponding subtree being
labeled as the right subtree (RST) on ðk; j1Þ or RSTk;j1 . It is
worth noting that each lateral wire connecting level k to
level kþ 1 spans 2k bits, as shown in Fig. 3, where the half-
white/half-gray node either represents a buffering node or
a gray node (which implements the -operator, as defined in
Fig. 3). The specific implementation of this node will
become apparent later. In addition, the length of the group
generate/propagate terms produced by RSTk;jj2kjn and
LSTk;j are labeled as LRðk;jj2kjnÞ and LLðk;jÞ, respectively,
where LRðk;jj2kjnÞ ¼ LRðkþ1;jÞ  LLðk;jÞ ¼ LRðkþ1;jÞ  2k. Based
on this notation, the entire carry computation tree for
computing cmi is denoted as RSTm;ji1jn with LRðm;ji1jnÞ ¼ n,
that is, the carry for bit i, i 2 ½0; nÞ, is produced at node
ðm; ji 1jnÞ.
The carry computation tree for the proposed Type-II
modulo 2n  1 adder can be recursively designed by
working in reverse from level m. We will initially consider
the construction of the chain of interconnected RSTs and
then describe the construction of the LSTs.
Consider an RST that is defined by RSTk;j. This node is
either fed from a single subtree RSTk1;j or from two
subtrees defined by LSTk1;j and RSTk1;jj2k1jn . The
precise implementation is determined by two factors: the
length of the term being produced by RSTk;j, LRðk;jÞ, and the
number of levels available to produce such a term, that is, k.
If more levels are available than required, that is, if
k > dlog2 LRðk;jÞe, then NBðk;jÞ ¼ k dlog2 LRðk;jÞe cascaded
buffers are used to traverse the extra NBðk;jÞ levels. In this
case, RSTk;j is considered to be a series cascade of NBðk;jÞ
buffers, with the topmost buffer being fed from the leaf
node of RSTkNBðk;jÞ;j, as illustrated in Fig. 4, where the white
and gray nodes represent the implementation of buffers and
, respectively.
On the other hand, if all k available levels are required to
produce a length LRðk;jÞ term, that is, if k ¼ dlog2 LRðk;jÞe,
then RSTk;j connects to LSTk1;j and RSTk1;jj2k1jn much in
the same way, as shown in Fig. 3. Note that, when n ¼ 2l,
where l > 1, there is no redundancy in the architecture and,
hence, no buffer nodes are used. Thereafter, the same
procedure is repeated for generating all of the RSTs up until
level 1 of the tree is reached. At this level, the required
length of the carry generate/propagate term will either be
one or two. If it is one, then the corresponding leaf node of
RST1;i will simply be a buffering node. Otherwise, the
subtree will be defined by RST1;jiðn1Þjn , where the
corresponding leaf node will implement .
As for the structure of LSTk;j, it is similar to that of a
balanced inverse binary tree of prefix operators with
2k inputs, as shown in Fig. 5. Note that level 0 is where the
terms gi, pi, and psi defined in (1) are computed. Correspond-
ing bit positions are annotated at the top of the figure. Note
that the length of term produced by LSTm1;ji1jn ,
LLðm1;ji1jnÞ, in the computation of c
m
i is defined to be equal
to 2m1 and, in general, an LST node at level k produces carry
generate/propagate terms of length 2k.
As an example, the architecture for the computation of
cmi when n ¼ 6 is illustrated in Fig. 6. This example
demonstrates the construction of the two different types
of RSTs, as shown in Figs. 3 and 4. At level 3, RST3;ji1jn is
required to produce a carry term of length n ¼ 6. The
number of levels required to produce such a term is
dlog2 6e ¼ 3, that is, k ¼ dlog2 LRðk;jÞe. Hence, the topology
described in Fig. 3 is used. Moving up to level 2, the LST
produces a term of length 22 ¼ 4, which means that
RST2;ji5jn needs to generate a term of length 6 4 ¼ 2.
However, as only one prefix level is needed to produce a
length 2 term, the topology presented in Fig. 4 is used,
where the number of cascaded buffers used is given by
NBð2;ji5jnÞ ¼ 2 1 ¼ 1. Note that each node at ð0; ji hjnÞ,
for h 2 ð0; nÞ, produces a pair of generate/propagate terms
defined by ðgjiajn ; pjiajnÞ.
1488 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007
Fig. 3. Subtree definition for cmi computation in Type-II adder.
Fig. 4. Buffer arrangement when k > dlog2 LRðk;jÞe.
Fig. 5. Structure of LSTk;j.
The final circuit for the computation of cmi is obtained by
connecting theprefix operators, as defined in Fig. 6, for each i,
i 2 ½0; nÞ, on the same prefix graph. The corresponding carry
computation unit for the proposed modulo 26  1 adder is
shown in Fig. 7. The figure shows that there is a prefix
operator at each bit position on each prefix level, with the
exception of the final level, where only the generate term of 
is needed. In addition, as the generate term produced by
RSTk;j in the computation of c
m
r is different from the generate
term produced by LSTk;j in the computation of c
m
q , r 6¼ q,
an additional buffer or  operator is required at each level k,
k 2 ð0;mÞ. This means that, in the worst case, that is, when
n ¼ 2l, l > 1, where no buffers exist in the computation of
cmi , there are ðm 1Þ  n  operators in addition to ðm
1Þ  n  operators on the first m 1 levels of the carry
computation tree. This results in adders with an increased
area complexity compared to the existing competitive
adders. However, a speedup of two logic levels is achieved.
Note that it is possible to reduce the number of  operators
in the carry computation structure for n 6¼ 2l, that is, when
buffers are used in the computation of cmi .
This is made possible by making use of the relation
defined in (11). Consider the modification of the RST on
level 2 of Fig. 6 in Fig. 8. It is easily verified that the generate
term produced by the bottom node in both subtrees is the
same. Although more area is required to implement the
subtree on the right of the figure, area is actually reduced in
the final carry computation circuit. This is because the
prefix operator  is required anyway and, thus, we are
actually replacing the -operator with an OR gate. The
modified architecture for n ¼ 6 is given in Fig. 9. Note that,
in a vertically cascaded chain of buffers, only the topmost
buffer in the chain is replaced by an OR gate. The rest
remain as buffers.
4 COMPARISON RESULTS
The fastest andmost competitivemodulo 2n  1 adders in the
literature are the ones proposed in [15], [16] (Dimitra-A), and
[10] (Dimitra-B). Since the adder in [15] can bedesignedusing
the methodology proposed in [16], subsequent comparisons
will include results for the designs in [16] and [10] only.
Furthermore, the Dimitra-B adder and, consequently, the
Type-I-B adder are not defined for all n [10]. The first step in
the implementation flow for the Type-I-A, Type-I-B, Type-II,
Dimitra-A [16], and Dimitra-B [10] adders involved writing
VHSIC Hardware Description Language (VHDL) descrip-
tions for a range of 3-bit to 9-bit moduli. These were then
mapped and implemented onto Virtual Silicon’s UMC
0.13 m standard-cell kit using Cadence’s PKS and Silicon
Ensemble tools. The delay values were obtained in ns using
PKS’s in-built static timing analysis (STA) engine after back-
annotating parasitic information on placed and routed
designs. The area values were obtained as the number of
standard-cell sites, where one site has an area equivalent to
1.728 m2. As for power estimation, the approach described
in [21] was used for all adders, where independent
pseudorandom inputs were clocked at 200 MHz. The
adders were simulated until the power estimates (in W )
were within a 95 percent confidence interval and exhibited
an error of less than 2 percent. Each design was recursively
PATEL ET AL.: FAST PARALLEL-PREFIX ARCHITECTURES FOR MODULO 2n  1 ADDITION WITH A SINGLE REPRESENTATION OF ZERO 1489
Fig. 6. cmi computation structure for the proposed Type-II adder when
n ¼ 6.
Fig. 7. Proposed modulo 26  1 carry computation structure.
Fig. 8. Example illustrating the removal of the -operator.
Fig. 9. Reduced area modulo 26  1 adder.
optimized for speed until the tool was unable to produce a
faster design. An area was reclaimed by the tool whenever
there was any positive slack. The results obtained are
displayed in Table 1.
Of the proposed adders, the Type-II adder is the fastest,
with average gains in delay of 9 percent and 11 percent over
the Type-I-A and Type-I-B adders, respectively. The recur-
sive implementation of the addition of c1out in the Type-II
adder means that the idempotency exploiting area reduction
techniques used in the Type-I adders cannot be employed.As
a result, an average increase of at least 23 percent has been
observed in area and power complexities from the standard-
cell implementations. This means that the Type-II adder
shares similar trade-off performance in theAD2 andE D
complexity measures to the Type-I adders. The choice
between these is driven by a given need for good delay, area,
or power performance. Of the two Type-I adders, the
increased fan-out property of the Type-I-B adder has a
negligible effect on delay for the small wordlength adders
considered. Moreover, the Type-I-B adder is generally
smaller and more power efficient.
In terms of addition latency, as Fig. 10 shows, the Type-II
adder significantly outperforms the competition. Average
improvements of 16 percent and 18 percent are observed in
comparison to Dimitra-A and Dimitra-B adders, respec-
tively. This comes at a cost of at least 19 percent in both area
and power complexities, which makes the Type-II adder
suitable for high-performance applications. It is interesting
to note that the proposed single zero correction technique
for the existing modulo 2n  1 adders results in faster
adders, where the Type-I-A and Type-I-B adders are
approximately 7 percent faster than the corresponding
Dimitra-A and Dimitra-B adders, respectively. Although
there is generally not much difference in area requirements
(as expected), it is clear that the proposed Type-I-A and
Type-I-B are less power efficient than the designs they are
based on. This is due to the reliance of the proposed adders
on both generate/propagate pairs at the last prefix level,
which results in tool optimizing for all paths in the circuit.
Power-consuming high-drive strength cells are used by the
tool to optimize delay performance for each path. This is in
contrast to the existing Dimitra-A and Dimitra-B adders for
which the critical path is dependent only on the generate
terms at the last prefix level. The propagate terms are
required during sum computation only and, hence, they are
required to be valid one logic level later in comparison.
An improved delay performance in conjunction with
similar area requirements has resulted in an average
improvement of at least 14 percent in the AD2 complex-
ities for both Type-I and Type-II adders compared to the
Dimitra-A and Dimitra-B adders. Moreover, although the
proposed adders are less power efficient, the improvement
in delay has resulted in an overall improvement of
approximately 3 percent and 8 percent in E D costs for
the Type-I-A and Type-I-B adders, respectively, over their
equivalent Dimitra-A and Dimitra-B adders. Tables 2 and 3
show results for loosely constrained Type-I-A and Type-I-B
adders, respectively. The shaded cells indicate equal or
1490 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007
TABLE 1
Area, Delay, and Power Results for
Considered Modulo 2n  1 Adders
Fig. 10. Comparison of the improvement in delay provided by the Type-II
adder.
better performance over the equivalent competitive adders.
When optimized for area, these results show that improve-
ment in each of delay, area, and power is achieved by the
proposed single zero correction technique. However, in
comparison to Dimitra-A and Dimitra-B adders, the power
density, that is, the power consumed per unit area, remains
higher for the Type-I adders.
The results comparing a loosely constrained Type-II
adder with the fastest of the Type-I adders are shown in
Table 4. It is seen that the Type-II adder can be optimized to
run as fast as the Type-I adders and occupy less area,
especially for smaller values of n (in general). For larger
values of n, the Type-II adder is expected to perform
comparatively worse in both area and power. Overall, there
is little to choose between the Type-II and Type-I adders.
In conclusion, the Type-II adder is the fastest. The Type-
I adders share similar area complexities to the Dimitra-A
and Dimitra-B adders, but consume more power per unit
area. Notwithstanding this, the faster Type-I adders can be
optimized to provide improvements across the entire
power-delay-area space when their delay performance
requirements are loosened to match those of the Dimitra-A
and Dimitra-B adders. The Type-II adder and Type-I
adders are comparable in the AD2 and E D metrics,
where, for nondelay skewed metrics, the Type-I adders fair
better.
5 CONCLUSIONS
New algorithms for modulo 2n  1 addition with single
zero representation have been presented. The algorithms
and, thus, the corresponding architectures, have been
derived by assuming an input carry of one into the first
stage of the addition. The derivations produce a new single
zero correction technique that can be applied to existing fast
designs to produce even faster adders. In addition, a new
technique for recirculating generate and propagate signals
in a parallel-prefix architecture that outperforms all existing
and other proposed modulo 2n  1 adders in terms of
modular addition latency has also been proposed. Back-
annotated very-large-scale integration (VLSI) implementa-
tions have demonstrated improvements not only in speed,
but also in the area delay2 and energy delay cost
functions when the proposed adders are compared against
existing adders.
ACKNOWLEDGMENTS
This research was supported by a White Rose Studentship,
offered in collaboration with the Universities of Sheffield
and Leeds, United Kingdom.
REFERENCES
[1] N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its Applications
to Computer Technology. McGraw-Hill, 1967.
[2] M.A. Soderstrand et al., Residue Number System Arithmetic: Modern
Applications in Digital Signal Processing. IEEE Press, 1986.
[3] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs.
Oxford Univ. Press, 2000.
[4] L.L. Peterson and B.S. Davie, Computer Networks: A Systems
Approach. Morgan Kauffman, 2003.
[5] R.V.K. Pillai et al., “A Low Power Approach to Floating Point
Adder Design,” Proc. IEEE Int’l Conf. Computer Design (ICCD ’97),
pp. 178-185, Oct. 1997.
PATEL ET AL.: FAST PARALLEL-PREFIX ARCHITECTURES FOR MODULO 2n  1 ADDITION WITH A SINGLE REPRESENTATION OF ZERO 1491
TABLE 2
Implementation Results Comparing a Loosely Constrained
Type-I-A Adder to the Equivalent Dimitra-A Adder
TABLE 3
Implementation Results Comparing a Loosely Constrained
Type-I-B Adder to the Equivalent Dimitra-B Adder
TABLE 4
Implementation Results Comparing a Loosely Constrained
Type-II Adder to the Fastest of the Type-I Adders
[6] C. Efstathiou et al., “Area-Time Efficient Modulo 2n  1 Adder
Design,” IEEE Trans. Circuits and Systems II, vol. 41, no. 7, pp. 463-
467, July 1994.
[7] A.A. Hiasat, “VLSI Implementation of New Arithmetic Residue to
Binary Decoders,” IEEE Trans. Very Large Scale Integration Systems,
vol. 13, no. 1, pp. 153-158, Jan. 2005.
[8] B. Cao et al., “Efficient Reverse Converters for Four-Moduli Sets
f2n  1; 2n; 2n þ 1; 2n þ 1 1g and f2n  1; 2n; 2n þ 1; 2n  1 1g,”
IEE Proc. Computers and Digital Techniques, vol. 152, no. 5, pp. 687-
696, Sept. 2005.
[9] T. Stouraitis and V. Paliouras, “Considering the Alternatives in
Low-Power Design,” IEEE Circuits and Devices Magazine, vol. 17,
no. 4, pp. 22-29, July 2001.
[10] G. Dimitrakopoulos et al., “A Family of Parallel-Prefix Modulo
2n - 1 Adders,” Proc. IEEE Int’l Conf. Application Specific Array
Processors (ASSAP ’03), pp. 315-325, June 2003.
[11] C. Efstathiou et al., “Modified Booth Modulo 2n  1 Multipliers,”
IEEE Trans. Computers, vol. 53, no. 3, pp. 370-374, Mar. 2004.
[12] S. Ming Hwa et al., “An Efficient VLSI Design for a
Residue to Binary Converter for General Balance Moduli
ð2n  3; 2n þ 1; 2n  1; 2n þ 3Þ,” IEEE Trans. Circuits and Systems II,
vol. 51, no. 3, pp. 152-155, Mar. 2004.
[13] R. Zimmermann, “Efficient VLSI Implementation of Modulo ð2n 

1Þ Addition and Multiplication,” Proc. 14th Symp. Computer
Arithmetic, pp. 158-167, Apr. 1999.
[14] N. Burgess, “The Flagged Prefix Adder and Its Applications in
Integer Arithmetic,” J. VLSI Signal Processing Systems for Signal,
Image, and Video Technology, vol. 31, no. 3, pp. 263-271, July 2002.
[15] L. Kalampoukas et al., “High-Speed Parallel-Prefix Modulo 2n  1
Adders,” IEEE Trans. Computers, vol. 49, no. 7, pp. 673-680, July
2000.
[16] G. Dimitrakopoulos et al., “A Systematic Methodology for
Designing Area-Time Efficient Parallel-Prefix Modulo 2n  1
Adders,” Proc. Int’l Symp. Circuits and Systems (ISCAS ’03),
vol. 5, pp. 225-228, May 2003.
[17] C. Efstathiou et al., “Modulo 2n 
 1 Adder Design Using Select-
Prefix Blocks,” IEEE Trans. Computers, vol. 52, no. 11, pp. 1399-
1406, Nov. 2003.
[18] R.E. Ladner and M.J. Fischer, “Parallel Prefix Computation,”
J. ACM, vol. 27, no. 4, pp. 831-838, Oct. 1980.
[19] P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient
Solution of a General Class of Recurrence Equations,” IEEE Trans.
Computers, vol. 22, no. 8, pp. 786-792, Aug. 1973.
[20] S. Knowles, “A Family of Adders,” Proc. 15th IEEE Symp. Computer
Arithmetic (ARITH ’01), pp. 277-281, June 2001.
[21] R. Burch et al., “A Monte Carlo Approach for Power Estimation,”
IEEE Trans. Very Large Scale Integration Systems, vol. 1, no. 1,
pp. 63-71, Mar. 1993.
Riyaz A. Patel received the BEng (first-class
hons.) degree in electronics and communica-
tions engineering and the MSc degree (with
distinction) in radio communications from the
University of Leeds in 2000 and 2001, respec-
tively, and the PhD degree in electronics from
the University of Sheffield in 2006, where he
studied VLSI aspects of modular adders.
Mohammed Benaissa is currently a senior
lecturer at the University of Sheffield. He has
been actively working in the area of VLSI signal
processing, error-control coding, and cryptogra-
phy for the past 18 years and has published
more than 80 papers in recognized journals and
conferences. He research interests include finite
number systems and their applications. He is a
senior member of the IEEE.
Said Boussakta received the Ingenieur d’Etat degree in electronic
engineering from the National Polytechnic Institute of Algiers (ENPA),
Algiers, Algeria, in 1985 and the PhD degree in electrical engineering
(signal and image processing) from the University of Newcastle upon
Tyne, Newcastle upon Tyne, United Kingdom, in 1990. From 1990-
2000, he was with the University of Newcastle upon Tyne as a senior
research associate in digital signal and image processing. From 2000-
2006, he was at the University of Leeds as a reader in digital
communications and signal processing. He is currently a professor of
communications and signal processing at the School of Electrical,
Electronic and Computer Engineering, University of Newcastle upon
Tyne, United Kingdom, where he is lecturing in communications
systems and signal processing subjects. His research interests are in
the areas of fast DSP algorithms, digital communications, communica-
tions networks systems, cryptography, and digital signal/image proces-
sing. He has authored and coauthored approximately 150 publications.
Professor Boussakta is a fellow of the IET and a senior member of the
IEEE and the IEEE Communications and Signal Processing Societies.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
1492 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 11, NOVEMBER 2007
