Novel power-delay-area-efficient approach to generic modular addition by Patel RA et al.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007 1279
Novel Power-Delay-Area-Efficient Approach to
Generic Modular Addition
Riyaz A. Patel, Mohammed Benaissa, Senior Member, IEEE, Neil Powell, and Said Boussakta, Senior Member, IEEE
Abstract—Modular adders are fundamental arithmetic compo-
nents typically employed in residue number system (RNS)-based
digital signal processing (DSP) systems. They are widely used in
modular multipliers and residue-to-binary converters and in im-
plementing other residue arithmetic operations such as scaling.
In this paper, a methodology for designing power-delay-area-ef-
ficient modular adders based on carry propagate addition is pre-
sented. The binary representational characteristics of the modulus
are exploited to allow the sharing of hardware in a fast modular
adder topology. VLSI implementation results using 0.13- m stan-
dard-cell technology, together with a theoretical analysis, show that
this approach produces adders that offer efficient tradeoffs when
compared with the fastest through to the smallest generic modular
adders in the literature.
Index Terms—Carry propagate adder, computer arithmetic,
ELM adder, modular adder, parallel-prefix adder, residue number
system (RNS), very large-scale integration (VLSI( design.
I. INTRODUCTION
MODULAR addition for arbitrary moduli plays an impor-tant role in the implementation of residue number sys-
tems (RNSs) that provide balanced arithmetic operations as well
as large dynamic ranges [1], [2]. Various approaches to the im-
plementation of modular addition have been proposed in the lit-
erature, and, in general, these are confined to memory lookup,
combinational logic, or an amalgamation of both [3], [4].
The RNS is a nonweighted integer number system that
is defined by an -tuple base of pairwise relatively prime
positive integers , which are col-
lectively known as the moduli of the system [5]. For a
given base, an integer is represented by an -tuple word,
, where , i.e., it is the
nonnegative remainder when dividing by the modulus .
Addition, subtraction, and multiplication operations are all
closed in RNS. Let denote the binary operation of addition,
Manuscript received July 29, 2004; revised May 5, 2005 and March 17, 2006.
This work was supported by a White Rose Studentship, offered in collaboration
with the Universities of Sheffield and Leeds, U.K. This work was presented in
part at the IEEE International Workshop on Signal Processing Systems (SiPS),
Austin, TX, October 2004. This paper was recommended by Associate Editor
S.-G. Chen.
R. A. Patel was with the Department of Electronic and Electrical Engineering,
University of Sheffield, S1 3 JD, Sheffield, U.K. He is now with Detica Ltd.,
Surrey Research Park, Guildford, Surrey GU2 7YP, U.K.
M. Benaissa and N. Powell are with the Department of Electronic and Elec-
trical Engineering, University of Sheffield, S1 3 JD, Sheffield, U.K.
S. Boussakta was with the Department of Electronic and Electrical Engi-
neering, University of Leeds, Leeds, U.K. He is now with the School of Elec-
trical, Electronic, and Computer Engineering, University of Newcastle upon
Tyne, Newcastle upon Tyne NE 17 RU, U.K.
Digital Object Identifier 10.1109/TCSI.2007.895369
subtraction, or multiplication. It then follows that is
isomorphic to , where ,
. Note that is solely dependent upon and ,
and, hence, the arithmetic operation is performed in parallel
with no interaction between the RNS channels. As a direct
consequence of this property, RNS systems are capable of
performing high-speed addition and multiplication, usually at
a fraction of the time taken in traditional two’s complement
systems.
In addition to high-speed implementations, RNS circuits
have also demonstrated power efficiency when implementing
bespoke digital signal processing (DSP) functionality [6]–[9].
The minimization of power dissipation has become an im-
portant performance objective as the trend increases towards
portability, denser circuits, and high performance. One of the
main sources of power dissipation in CMOS circuits is dynamic
power dissipation, which is given by [10]
(1)
where is the activity factor, is the load capacitance,
is the supply voltage, and is the clock frequency. A variety of
techniques exist to reduce the parameters of (1) at all levels of
design abstraction [10], including the use of alternate number
representations such as RNS [11]. As an example, Freking
and Parhi [8] showed that, as well as offering a reduction in
switching activity, the speed advantage offered by RNS can be
traded with a reduction in supply voltage to obtain a quadratic
reduction in power—this provides a power delay advantage
when compared with two’s complement implementations.
Amongst the plethora of DSP applications, RNS has shown
power efficiency in implementing FIR filters [8], [12], [13],
frequency synthesizers [14], programmable DSPs [15], and
secure image coding schemes [6].
In general, the design approach for modular adders falls into
two distinct categories. On the one hand, they can be designed
for flexibility in which case the methodology allows the de-
sign of adders for any moduli. On the other hand, if one desires
architectural simplicity and the best performance, then adders
designed for a specific set of efficient moduli (e.g., )
have also been proposed [16], [17]. Our contribution is focused
on modular adders belonging to the former class, and we will
henceforth limit all discussion on this class of modular addition.
Note that, although the adders for moduli of the form are
more efficient than generic modular adders, RNSs that restrict
the moduli to this form suffer from either large wordlength RNS
channels, imbalanced arithmetic, or both when large dynamic
ranges are required [1], [2].
1549-8328/$25.00 © 2007 IEEE
1280 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
In [18], Taylor proposed several approaches based on binary
adders. Bayoumi and Jullien [4] proposed three types of mod-
ular adders. These were based on binary adders, lookup tables
(LUTs), and a hybrid of the two. Dugdale [19] described the im-
plementation of sequential modular adders using two cycles of
binary addition with feedback to perform the second addition.
Piestrak [20] presented a modular adder design as a final stage in
a residue-to-binary converter. This approach used a carry-save
adder (CSA) array with two binary adders. More recently, Hi-
asat [21] presented an adder, which was similar in parts to the
Piestrak [20] contribution, and was based on a combination of
both CSA and carry-propagate-addition (CPA) techniques.
In this paper, a new algorithm and several corresponding ar-
chitectures for generic modular addition are presented. The fun-
damental idea, as seen from a high abstraction level, exploits the
advantage offered by the two’s complement of the modulus to
reduce the circuit area for generic modular addition without sac-
rificing performance. In doing so, the moduli that define an RNS
base are characterized according to the complexity of the corre-
sponding modular addition. Moreover, it will be shown that, for
a given wordlength , adders for moduli of the form result
in the smallest implementations. Architectures encompassing a
range of parallel-prefix algorithms, which are commonly em-
ployed in the implementation of fast arithmetic systems, are pre-
sented. For fast modular addition, the sharing of hardware not
only reduces static dissipations, but it also reduces the overall
switching in the adders and thus providing a reduction in dy-
namic power dissipation.
All current designs in the literature have aimed at optimiza-
tions in the delay-area space. When comparing the proposed de-
signs with existing designs in the literature, it will be shown that,
as well as being efficient in terms of delay and area, the proposed
adders are also capable of dissipating less power.
The remainder of this paper is organized as follows. A brief
introduction to parallel-prefix addition is provided next with a
description of the competitive modular adders. Thereafter, the
proposed algorithm and architectures are described, followed
by a discussion on the characteristics of moduli that allow for
efficient implementation of the proposed adders. Before conclu-
sions are drawn, theoretical as well as VLSI implementation re-
sults are presented and compared with existing modular adders.
II. BACKGROUND
Parallel-prefix addition techniques and competitive modular
adder designs are described next.
A. Overview of Parallel-Prefix Addition
The key to fast radix-2 addition of two operands
and is in the re-
duction of the latency in the carry network, where, e.g., is the
th bit of . In this regard, the actual operand bits are not impor-
tant. What is important is whether a carry is generated or prop-
agated from a given bit position. In order to implement such an
adder, the architecture can be divided into three distinct stages.
In the first stage, carry generate , carry propagate
and partial sum bits are computed for
every bit , where , where , and represent the
logical operators AND, OR, and XOR, respectively. The second
stage computes the carries , , as well as the output
carry . In the final stage, the sum is
computed according to .
As the first and third stages are relatively simple, the main
problem in adder design lies in the fast generation of the car-
ries. The carry at each bit position can be derived according
to the well known carry recurrence
[22]. By unrolling this carry recurrence and by implementing
in parallel the resulting equations, the well-known CLA adders
are formed. Note that after fully unrolling the carry recurrence,
the maximum gate fan-in for an adder is . As
gates with a high number of inputs are either not available or
too slow in current VLSI technology, the resulting architectures
are implemented using multilevel lookahead structures [22]. A
faster method for reducing the latency of carry computation
is to consider it a prefix problem [23]. In this case, inputs
and an arbitrary associative operator
are used to produce outputs according
to the relation where . For
prefix addition, the operator is defined as [23]
(2)
With the carry generate and propagate terms defined as before,
the carry into bit , , is given by , where
if
if (3)
It is also possible to interchange with in the
computation of without affecting the result [22], where
is given by
if
if (4)
Note that, in general, and (or ) denote the
group generate and group propagate signals over bit positions
to , where .
Many prefix structures in the literature parallelize the compu-
tations of (3). Each of these structures results in a parallel-prefix
adder with a different depth and size to the others and thus offers
alternative power, delay, and area tradeoffs. The algorithms pro-
posed by Ladner–Fischer (LF) [23] and Kogge–Stone (KS) [24]
are the end cases of a family of parallel-prefix adders that share
the attractive property of processing all of the carries using min-
imum logic-depth prefix structures [25]. The prefix structures
for these adders are shown in Fig. 1 for , where the black
nodes represent the prefix —operator and the white nodes rep-
resent buffering nodes.
An alternative parallel-prefix implementation appeared in
[26] under the name ELM. The main distinguishing feature of
the ELM adder over traditional parallel-prefix adders is the si-
multaneous addition of partial carry information (or processing
of partial sum information) while the proper carries are still
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1281
Fig. 1. Parallel-prefix structures by (a) Ladner–Fischer and (b) Kogge–Stone.
Fig. 2. 8-b ELM tree with cell definition.
being computed. For this case, the prefix operator can be
defined as
(5)
where the symbol represents either of the and logical
operations. Note that the -operator is not explicitly defined in
[26]. It is, however, implicitly assumed in the computation of
the sum bits. In this case, the carry into bit , , is
given by
if
if (6)
where and are as defined earlier. In comparison with the
carries computed using (3), the latency in the carry network is
increased when (6) is used. However, (6) allows the sum bits to
be computed in parallel with the carry computation and, thus,
results in adders that are fast with a reduced interconnect re-
quirement. Nagendra et al. [27] have shown in their paper that
these properties can lead to efficiency in power, delay and area
when compared to the adders designed according to (3). Fig. 2
illustrates the structure of an 8-b ELM adder with the input carry
set to zero, where the input at bit position represents the pair
of signals and defined before. Each node is implemented
using a combination of the ELM cells defined in the figure. The
functionality within the nodes at a given level is identical to all
of the other nodes on that level, with the exception of the right-
most node which passes up carry bits only. In addition, the func-
tionality of a node at a given level is different to that on another
level with nodes becoming increasingly complex as we traverse
up the tree [26].
B. Competitive Modular Adder Designs
The combinatorial modular adders in the literature offer
varying tradeoffs in design complexity. At first glance, Bay-
oumi et al.’s purely combinatorial adder [4] offers the worst
tradeoffs, as the design is both slow and large. However, it
will be shown that the nature of the design allows for sig-
nificant complexity reduction when optimized during VLSI
implementations. The fastest and, consequently, largest design
in the literature is the one proposed by Piestrak [20]. In the
most recent contribution from the literature, Hiasat [21] has
proposed an adder which trades speed for area in comparison
with Piestrak’s adder. In [21], Hiasat showed that his adder
was both faster and smaller compared with that of Bayoumi et
al. when both are implemented using full custom technologies.
Interestingly enough, and in contrast to Hiasat’s results in [21],
the results in this paper will show that Bayoumi et al.’s adder
can outperform Hiasat’s adder in both the critical path delay
and area when the designs are implemented and optimized
using 0.13- m standard-cell technology. Moreover, whereas
Hiasat’s conclusions were based on comparative results for two
low-cost (LC) generic moduli,1 the conclusions made herein
are based on a more thorough examination of possible RNS
moduli within a range of 5–8 b.
1See Section IV for details on moduli characterization. Note that Hiasat did
not address the issue of moduli cost.
1282 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
Fig. 3. Modular adders proposed by (a) Bayoumi et al. [4], (b) Piestrak [20], and (c) Hiasat [21].
Modular adders based on redundant number systems, such as
the binary signed-digit (BSD) number system, have also been
considered [28]. These adders, however, are efficient for algo-
rithms where a series of repeated additions are required (e.g., as
in modular multiplication) since conversion from the redundant
representation is costly. Hence, as this paper is concerned with
modular adder architectures where both input and output are in
residue form, adders based on redundant number systems have
not been considered. Moreover, ROM-based techniques are gen-
erally not as efficient as combinatorial implementations, and so
they are also not included in the subsequent comparisons.
For an modulus , where and where
, in this case, represents the smallest integer greater than ,
the modular addition problem can be formulated as
if
if (7)
where , , and are positive integers less than .
The subtraction in the above equation can be replaced by an ad-
dition of the additive inverse of ,
, and, as a result, the modular addition equa-
tion above can be redefined as [19], [21]
if
otherwise. (8)
The equation for above identifies the general approach
used to perform modular addition. If the carry from the sum
is 1, then the result of the modular addition is the
integer represented by the LSBs of the corresponding sum.
Otherwise, the sum of is chosen as the correct
result.
The structure of the adder presented by Bayoumi et al. is
shown in Fig. 3(a). The circuit consists of a series connection of
two CPA adders, an OR gate, and a multiplexer. Adder A com-
putes the sum while Adder B computes ,
where represents the LSBs of . If the carry from
either addition is 1, the multiplexer chooses sum B, otherwise
sum A is chosen.
Piestrak’s fast architecture is shown in Fig. 3(b). The modulo
result is computed using a CSA and two CPAs working in par-
allel. Adder A computes the sum , while the series con-
nection of the CSA and Adder B computes .
Note that CSA is a simplified CSA structure, where the ex-
tent of the simplification is dependent on the binary form of .
If the carry from Adder B is 1, the multiplexer chooses sum B,
otherwise sum A is chosen.
The proposal by Hiasat is shown in Fig. 3(c). Hiasat first re-
duces the three-operand addition to a sum (a) and
carry (b) using a simplified CSA structure called the SAC
unit (similar to Piestrak’s CSA ). The addition is also
transformed to a sum and carry with set to 0 in the
SAC unit. Thereafter, a carry propagate and generate (CPG)
unit is used to compute the propagate and generate terms for the
addition and the addition .
Note that propagate terms are required to be computed using
XOR gates in this adder. If OR gates are used then the MUX
unit needs a row of 3 1 multiplexers, which compromises the
design’s area efficiency. The output carry is then calculated for
the addition. For -bit modular addition, a max-
imum of multiplexers are then used to select
the appropriate propagate and generate terms based on the com-
puted carry, after which the CLAS unit uses CPA techniques to
compute the modular sum using the selected propagate/generate
pairs.
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1283
Fig. 4. Block diagram of the proposed modular adder.
In this paper, all CPA-based architectures that are used
in these competitive adders are implemented using the
Ladner–Fischer parallel-prefix (LFPP) algorithm.
III. METHODOLOGY AND ARCHITECTURES FOR THE DESIGN
OF MODULAR ADDERS USING SHARED HARDWARE
Here, the proposed algorithm and the corresponding archi-
tectures for generic modular addition will be presented. The
modular addition algorithm shown in (8) can be implemented
by the block diagram of Fig. 4. The adder consists of three
main modules, namely, the simplified CSA , the
double addition with shared hardware using carry propagate
addition (DASH-CPA) module, and the MUX module. Each
module shall now be described separately.
A. SCSA(T) Module
The module has the operands and as input and
generates the carry-save representation of and .
The structure is a simplified version of the conventional CSA,
where the extent of the simplification is highly dependent upon
the binary form of . A similar structure was used by Piestrak
[20], and later by Hiasat [21], in their respective proposals when
reducing the sum to two operands. For the conve-
nience of the reader, the resulting simplifications are reproduced
next.
Let and , respectively, denote the sum and carry out-
puts from the th FA in a CSA, then
(9)
(10)
where , , and represent the th bit in the unsigned bi-
nary representation of , , and , respectively, and where
. If the input bit is 0 then, writing as
and as , (9) and (10) reformulate as
(11)
(12)
Similarly, if is 1, then (9) and (10) can be written as
(13)
(14)
Fig. 5. Cells used to compute the sum and carry when t is 1: (a) E -cell and
(b) O-cell.
From (11) and (12), it is evident that, when , the sum and
carry bits for the sum are identical to the sum and carry
bits for the sum . Note that, for a modulus , where
the binary representation of has ones, the bus signals
and con-
sist of bits that are different from those in
and .
In the remaining sections of this paper, all signals relating ex-
clusively to the addition will be denoted with the
superscript . In addition, for moduli, the most significant
bit (MSB) of the vector , , is always zero, and hence
forms partial output carry information for the addition .
The circuits for (11)–(13) can be readily implemented using
cells that have already been defined for the ELM adder in
Fig. 2, i.e., (11) can be implemented using an E-cell, (12) using
a P-cell, and (13) using an E-cell with an inverted output (im-
plementing the XNOR function), which will be referred to as the
É-cell. The block representations of (13) and (14) are shown in
Fig. 5(a) and (b), respectively, where the O-cell symbolizes the
logical OR function. If the binary representation of contains
ones, then the total hardware cost for the SCSA(T) module is
-cells, -cells, -cells, and -cells.
B. Dash-CPA Module
This module is at the heart of the proposed adder, and it is
the most crucial in determining the performance of the entire
adder module. The aim of this module is to compute both sums
and in parallel and, then, using the output
carry resulting from the sum , , to select the
correct modulo sum. The novelty of this module is that it al-
lows the sharing of logic units in the CPA and thus allowing
the computation of two parallel additions using hardware less
than that required for two CPAs. The CPA designs considered in
this paper include the ELM [26], Ladner–Fischer (LF) [23] and
Kogge–Stone (KS) [24] parallel-prefix (PP) adders. The ELM
adder was considered due to its predemonstrated efficiency in
power, delay, and area [27]. The LF and KS prefix adders were
chosen over other existing prefix adders as they represent end
cases in area and delay efficiency, respectively, amongst existing
minimum-logic-depth prefix structures [25].
The DASH-CPA processes two additions and not
two additions. This is because , which
means that and are given by and , respectively.
Thus, the LSB of the sum for both additions need not be com-
puted in this module. In addition, no carry will be generated into
bit position 1 and thus and .
1284 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
Fig. 6. Methodology describing the implementation of the DASH-CPA
module.
This module can be constructed using the methodology pre-
sented in Fig. 6, regardless of which CPA topology is used. Note
that each processing operator/node/generator that is commonly
used to describe the aforementioned CPA architectures will be
subsequently referred to as a logic unit under a uniform umbrella
notation for ease of understanding.
To illustrate the proposed modular CPA design methodology,
tree structures for ELMMA, LFPPMA, and KSPPMA corre-
sponding to the CPA architectures ELM, LFPP, and KSPP, re-
spectively, are shown in Fig. 7 for . Note that the
CPA structure for the addition can be identified by con-
joining the thinly dashed units with the solid-lined units, where
the solid-lined units represent hardware shared between both ad-
ditions. Similarly, the CPA structure for the addition
can be identified by conjoining the thickly dashed units with the
solid-lined units.
For , and .
Hence, and for and 1. This
means that the pairs of bits are different from the
corresponding pairs of bits for , 1, and 2, i.e.,
. As a result logic units
are shared on level 0, corresponding to bit positions 3 and 4 as
is shown in Fig. 7. At subsequent levels, it is evident that any
logic units that process information exclusively derived from
these two shared logic units are not replicated for the addition
. For example, logic unit 5 in the ELMMA tree is
Fig. 7. DASH-CPA implementation forM = 29 using (a) ELMMA tree, (b)
LFPPMA tree, and (c) KSPPMA tree.
Fig. 8. Cell layout for ELMMA whenM = 29.
not replicated as it depends exclusively on logic units 2 and 3,
both of which are shared by the two additions. A more detailed
structure for the ELMMA tree is shown in Fig. 8, where the
logic units have been implemented using the predefined cells of
Fig. 2.
Note that the parallel-prefix structures LFPPMA and
KSPPMA have not been drawn in the conventional manner as
shown in Fig. 1. Instead, the functionality of each logic unit is
given by using appropriate labels, where and represent the
computation of the generate and the propagate term, respec-
tively, of the prefix operator . This was done to help identify
the two parallel CPAs.
C. Mux Module
This module consists of multiplexers that are used to
select the sum bits generated from the two additions performed
in the DASH-CPA module. The signal drives the multi-
plexer select port for all the multiplexers and it therefore has
a fan-out equal to . An example MUX module is illustrated in
Fig. 8.
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1285
In Section IV, a discussion follows that characterizes a given
modulus into either a high-cost generic modulus (HCGM) or a
low-cost generic modulus (LCGM), where a LCGM allows the
greatest savings in hardware for the proposed adders. The anal-
ysis applies to all moduli with an internal residue wordlength of
.
IV. CHARACTERIZATION OF MODULI BASED ON MODULAR
ADDER COMPLEXITY
For a given residue wordlength , there are a total of
integers that lie in the range . Of these integers,
are odd and are even. Only one even integer can be
considered as a valid modulus because, for an RNS system to
achieve maximum representational efficiency, it is imperative
that all moduli are relatively prime [5], and thus only one even
modulus is allowed in the definition of an RNS base. Since the
design of modulo adders is easy and as they offer the largest
dynamic range over all possible moduli for a given wordlength
, the even modulus is almost always chosen to be .
In addition, as efficient adder circuits exist for moduli of the
form [16] and [17], there are a total of
odd integers in the range that can
be classified as generic moduli. In order for a modulus to be
characterized as either HCGM or a LCGM, attention has to be
paid to the binary representation of .
From the description of the module in
Section III-A, it is evident that the module is
simpler when has a small number of
1’s, i.e., when its Hamming weight is small. As is odd,
the value of is always 1. In addition, for only
for , i.e., only for . Hence,
for the range under consideration the minimum value for
is 2 which corresponds to , 5, 9, , ,
where . In general, the generic moduli that offer
minimum complexity in the module are given by
,.
Note that is not the only consideration for the simplifi-
cation of the proposed modular adders. What is more significant
is the value of . From (13) and (14), it can be seen that, if
, then bit positions and are different at the input
of the DASH-CPA module for the two additions. It was shown
in Section III-B that, for , . When ,
and hence bit positions 0, 1, 3, and 4 are
different, i.e., . Hence, although is the same for
and , more hardware is required for the imple-
mentation of a modulo 23 adder. In general, for a given value
of , the amount of shared hardware is maximized when the
LSBs in are set to one.
From the above, it can be seen that each modulus can be char-
acterized by the variables and , where
and for . As di-
rectly impacts the amount of shared hardware in the main por-
tion of the adder, i.e., the DASH-CPA, its value shall be used
to classify the generic moduli. More specifically, if is less
than or equal to , where rep-
resents the largest integer smaller than or equal to , then the
modulus will be classified as a LCGM. This value for is
chosen as the midway point in the range of values for . If
then the corresponding modulus will be classi-
fied as a HCGM. Of all the LC moduli for a given wordlength,
the value of results in minimum values for and
. Note that moduli of this form have recently been shown
to form an efficient RNS base that offers balanced arithmetic as
well as a large dynamic range coverage [1].
Illustrative example: Consider the moduli and
where . For this case, .
When , , and hence 29 is an LCGM. If
, and , which means 19 is
an HCGM.
V. RESULTS
A comparison of the proposed adders is made with the com-
petitive adders in the power-delay-area space. Following this,
the advantage of using LC moduli over HC moduli is quantified
using VLSI implementation results for the proposed adder ar-
chitectures.
A. Performance Comparison
Both qualitative and quantitative comparisons of the proposed
architectures with the architectures proposed in [4], [20], and
[21] are presented. For the qualitative comparison, a simple
model is employed, and actual 0.13- m CMOS implementation
results are described for the quantitative comparison.
1) Theoretical (Qualitative) Analysis: The widely used unit-
gate model ([29], [30]) is used to derive circuit size and delay es-
timates. When using this model, each two-input monotonic gate
(e.g., AND or OR) counts as one gate in both area and delay. An
XOR/XNOR gate counts as two gates in area and delay. The model
ignores the effect of fan-out and hence the resulting estimates
will be validated using VLSI implementations. The theoretical
estimates for a given wordlength will be derived for moduli that
do not allow the sharing of hardware, i.e., for moduli where
and . This corresponds to the worst
case for all of the proposed adders as well as for Hiasat’s adder
in [21]. For a given wordlength, the unit-gate delay and area
complexities of Bayoumi et al.’s and Piestrak’s adders are un-
affected by variations in and . Table I displays the
unit-gate delay and area values for commonly employed mod-
ular adder components.
Detailed complexity evaluation for ELMMA follows, where
the complexities for the remainder of the proposed adders and
those for Bayoumi et al.’s, Hiasat’s, and Piestrak’s adders are
evaluated in the appendix to enhance readability.
In the worst case, the unit requires each of the
P-cells and E-cells and each of the É- and O-cells, and
thus the total unit-gate area is . The delay
through the unit , where is
the delay through an E-cell.
For the section of the ELMMA tree (DASH-CPA
module), the area requirement is that of a
ELM adder without the output carry computation circuitry.
Thus, any logic that generates or propagates carries from
the MSB is not required. This means that the generate term
computed at the MSB, i.e., at bit position , on levels 0
and along with the generate/propagate pairs computed
at the MSB on levels 1 to are not required. Hence,
1286 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
TABLE I
UNIT-GATE CHARACTERIZATION OF MODULAR ADDER COMPONENTS
TABLE II
WORST CASE UNIT-GATE DELAY AND AREA ESTIMATIONS FOR PROPOSED
AND EXISTING MODULAR ADDERS
the circuit for the addition in the ELMMA tree re-
quires an area of
-
,
where , and represent the area complex-
ities of a -bit ELM adder with , a -cell and a
-cell respectively. The area requirement of the logic that
computes is given by
-
, where the
O-cell is used to compute the output carry . Hence, the
total area required by the ELMMA tree is
-
- -
. The critical path through the ELMMA
tree is the same as that through a -bit ELM adder, i.e.,
-
.
The MUX module requires multiplexers, which results
in a total area of
-
and a critical path delay of
-
. The overall unit-gate area and delay estimations
for ELMMA are thus given by
-
(15)
- -
(16)
Final unit-gate area and delay estimations for all adders are
summarized in Table II. As expected, the critical path delay
Fig. 9. Worst case theoretical variation in D  A.
estimates are superior for the proposed parallel-prefix adders
and Piestrak’s adder. Bayoumi et al.’s adder is the slowest
as the critical path consists of two CPAs. The area-reducing
technique used by Hiasat produces a design that is theoretically
the most frugal in silicon requirements, even when considering
that the evaluation reflects worst case figures. On the other
hand, KSPPMA is the most deficient in area. This is because the
bounding of fan-out in the KSPP architecture results in an area
overhead, which is effectively doubled for worst case moduli.
In order to illustrate the worst case theoretical design trade-
offs for the adders, the delay area estimates of all
adders for wordlengths of 5–8 b have been plotted in Fig. 9.
The performance for ELMMA and LFPPMA is very
similar for all , and the much increased area complexity of
KSPPMA results in a tradeoff performance reduction. However,
as the unit-gate model does not account for fan-out, the delay
performance of the bounded fan-out KSPPMA is expected to
be superior to the unbounded fan-out LFPPMA and ELMMA.
The variation suggests that Piestrak’s adder will per-
form slightly better than LFPPMA and ELMMA as increases.
However, it is worth noting that the unit-gate estimate for the
binary adders in Table I are accurate only when is a power
of two. The area is particularly overestimated when is not a
power of two. Moreover, the varying dependence of the modular
adders on the sizes of the CPAs used (e.g., and ), means
that the accuracy of the comparison is further compromised.
Thus, the unit-gate estimates allow us to postulate that Pies-
trak’s adder will exhibit a similar combined design complexity
to LFPPMA and ELMMA for high-cost (HC) moduli. This is
as expected. Furthermore, based on these estimates, Bayoumi
et al.’s and Hiasat’s adders will be the least efficient. One im-
portant aspect of Bayoumi et al.’s adder, which is not reflected
in the unit-gate estimates, is that one of the operands for Adder B
is a constant. This will lead to simplifications during logic opti-
mizations, and hence improve its performance in the delay-area
space. Results from actual VLSI implementations thus need to
be obtained so that accurate comparisons can be made.
2) VLSI Implementation Results: For greater practical in-
terest and to obtain meaningful power dissipation estimates for
the adders, we described the proposed modular adders, Bayoumi
et al.’s adder [4], Piestrak’s adder [20], and Hiasat’s adder [21]
in VHDL for a range of 5–8-b moduli. For each wordlength,
both HCGM and LCGM were implemented, where for the
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1287
HCGM and .
Similarly, and for
the chosen LC moduli. For a given wordlength , the moduli
and satisfy the requirements for the HCGM
and LCGM, respectively, and thus only moduli of these forms
were considered. The objective for choosing such moduli was
to illustrate the impact of modulus choice on the delay, area,
and power performance of a modular adder, and thus aid the
RNS system designer in selecting an implementation-economic
moduli set.
Cadence’s PKS and Silicon Ensemble tools were used to map
generic gate-level VHDL descriptions of all designs onto Virtual
Silicon’s UMC 0.13- m standard-cell design kit. Each design
was recursively optimized for speed until the tool was unable
to meet the performance constraints. We then selected results
from the optimization run that provided the smallest
product, with the proviso that the maximum delay was within
2% of the minimum delay achieved through the extensive sim-
ulation process. The delay values were obtained after back-an-
notating parasitic information on the placed and routed designs
using PKS’s in-built static timing analysis (STA) engine. As
for power estimation, we adopted the power estimation method-
ology presented in [31], where the adders were presented with
independent pseudorandom inputs clocked at 200 MHz. The
adders were simulated until the power results were within a 95%
confidence interval and resulted in less than 1% error.
We simulated the adders using two different optimization ap-
proaches: layout-level optimization and aggressive optimiza-
tion. During layout-level optimization, the tool was restricted to
performing buffer insertion (to alleviate fan-out problems) and
cell resizing (for nontiming-critical paths) optimizations only.
The aim here was to validate the theoretical estimates.
Table III presents the results for this approach. In the table,
we have indicated whether a modulus is high-cost (HC) or LC.
The results show that the proposed adders and Piestrak’s adder
are significantly faster compared to Hiasat’s and Bayoumi et
al.’s adders. In general, for all the adders studied, Piestrak’s
adder and KSPPMA are the fastest, but with only slight im-
provements over ELMMA and LFPPMA. Although the critical
path delay for these adders is theoretically the same, the advan-
tages of KSPPMA and Piestrak’s design are due to different rea-
sons. For KSPPMA, the bounded fan-out parallelization of the
carry computation is a big factor. For Piestrak’s design, the ad-
vantage is down to its layout. Adder A and Adder B in Piestrak’s
adder of Fig. 3(b) are separate entities, which results in local
optimizations for these adders within their own respective envi-
ronments. On the other hand, the proposed adders are designed
to allow the sharing of hardware between the two parallel ad-
ditions, and thus combining their respective environments. This
means that when laid out in silicon, the proposed adders may
require lengthier interconnections since the wiring needs to be
routed around cells for both additions. As a result, the capac-
itive load on these wires is increased, which compromises the
delay performance. It is worth noting that Piestrak’s adder is ap-
proximately 2% faster than LFPPMA on average, but at an ex-
pense of at least 24% in both area and power complexities. This
demonstrates the superiority of the proposed hardware sharing
approach.
TABLE III
AREA, DELAY, AND POWER RESULTS FOR THE LAYOUT-LEVEL OPTIMIZATION
FLOW
Although Hiasat’s architecture depicts increasing area ef-
ficiency with an increase in the wordlength , the superior
delay performance of the proposed adders leads to an improved
tradeoff performance. As expected, Bayoumi et al.’s adder
is inferior to the proposed adders for all cost parameters. In-
terestingly though, Bayoumi et al.’s adder is generally faster
than Hiasat’s adder. This is partly because the high fan-out
output carry signal that drives the MUX unit in Hiasat’s adder
[21] is on the critical path, whereas for Bayoumi et al.’s adder
1288 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
Fig. 10. Modular adder tradeoff performance comparison for the layout-level optimization approach.
[4] it is not. In addition, Adder B in Bayoumi et al.’s adder
performs addition with a constant, which, even without logic
optimization, simplifies the first stage in the addition and thus
provides relative improvements in speed. During aggressive
optimization greater savings are expected.
The power dissipated by the adders in our case is directly pro-
portional to the silicon area for all adders. This is intuitive since:
1) static dissipations are proportional to area and 2) switching
activity (dynamic dissipations) is proportional to the number of
nodes within a circuit, which in turn is proportional to its area.
As the underlying topology for all considered adders is the same
(parallel-prefix), and since the wordlengths are small, there is
little distinguishing the adders in terms of the impact of inter-
connect on the power dissipation. Furthermore, for LC moduli,
the area (and hence power) of both LFPPMA and ELMMA
is very similar to Hiasat’s area-efficient adder. The combined
complexity performance in this case is superior for the pro-
posed adders as they are much faster, especially for battery-op-
erated applications where energy consumption is of importance.
In comparison to Piestrak’s adder, the sharing of hardware has
resulted in a dramatic reduction in power, which demonstrates
the significant impact of area on power.
Focusing on the proposed adders, from Table II we expected
ELMMA to be as fast as the other parallel-prefix adders.
However, Table III indicates otherwise. The reason is that
for LFPP and KSPP adders, the critical path consists of a
series of AND-OR/AND gates, which in complementary CMOS
can be efficiently implemented using a series of alternating
AND-OR-INVERT (AOI)/NAND and OR-AND-INVERT (OAI)/NOR
gates. For ELM, the critical path is dependent on a series of
XOR gates, and as the XOR gates are not as fast in the CMOS
standard-cell library used [32] as the AO/AOI/OAI gates, the
ELM based adder is slightly slower.
The graphical variation of the and cost
functions for all moduli studied is shown in Fig. 10. The graphs
show that the proposed adders significantly outperform the com-
petitive adders for both complexity functions, regardless of the
type of modulus. In , ELMMA and LFPPMA offer an
average saving of at least 43%, 23%, and 27% compared with
Bayoumi et al.’s, Piestrak’s, and Hiasat’s adders, respectively.
As for comparison in , we see that, from the compe-
tition, Bayoumi et al.’s adder is the most deficient with average
savings of at least 58% offered by the proposed adders.
For the second optimization approach, we essentially opti-
mized a flattened netlist by allowing the tool full freedom in op-
timizing the designs for high performance. The results for this
optimization flow have been displayed in Table IV, with Fig. 11
illustrating the relative tradeoff performance of the adders.
The results show that, on average for the proposed adders,
KSPPMA is the fastest and ELMMA is the smallest and most
power-efficient. Where the cost functions and
are considered, ELMMA provides the greatest efficiency, with
marginal improvements over LFPPMA.
When compared with Hiasat’s adder, ELMMA provides sav-
ings of approximately 25% in both and
complexities. Compared to Bayoumi et al.’s design, although
the tradeoff performance is worse for HC moduli, ELMMA
achieves average savings of 8% and 11% in and
, respectively, over all moduli. Note that, for Bayoumi et
al.’s adder, whether a modulus is HC or LC does not have much
of an impact on adder performance. This is because the addition
is with a constant in both cases. Thus, after aggressive logic re-
structuring and optimization, the adder circuit is significantly
simplified which, as a result, improves area, delay, and power
performance. Piestrak’s adder, although faster, is 12% and 30%
less efficient than ELMMA in and , respec-
tively.
For applications where energy (i.e., power delay) reduction
and delay are important considerations, the energy delay
product is a valid performance metric [33]. Fig. 12 illustrates
the efficiency of LFPPMA over all other designs for the
aggressive optimization approach. We see that over all the 5–8-b
moduli considered, regardless of the type of modulus, LFPPMA
offers advantages over Bayoumi et al.’s, Piestrak’s, and Hiasat’s
adders, with average gains of 29%, 12%, and 48%, respectively.
LFPPMA and ELMMA have very similar average performance,
and where the relative size of KSPPMA results in an increased
complexity.
Finally, the impact of the proposed hardware sharing prin-
ciple is best measured by comparing LFPPMA with the LFPP-
based competitive adders for LC moduli. The efficiency in de-
sign tradeoffs is quantified in Table V. Double-digit gains are re-
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1289
TABLE IV
AREA, DELAY, AND POWER RESULTS FOR THE AGGRESSIVE OPTIMIZATION
FLOW
ported in comparison with each adder, which demonstrates the
superiority of the proposed methodology in comparison to the
smallest and the fastest modular adders in the literature.
B. LCGM Versus HCGM
Here, we describe the relative merits of choosing an LCGM
over an HCGM in the design of the proposed adders. By direct
inspection of the graphs shown in Figs. 10 and 11, a marked
improvement in performance is evident for LC moduli. For the
layout-level optimization approach, the proposed designs offer
Fig. 11. Modular adder tradeoff performance comparison for the aggressive
optimization approach.
Fig. 12. % savings achieved in the energy  delay product by LFPPMA for
the aggressive optimization approach.
TABLE V
AVERAGE SAVINGS RESULTING FROM THE PROPOSED HARDWARE SHARING
PRINCIPLE IN THE D  A AND P D  A COMPLEXITY EVALUATION
MEASURES
1290 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
Fig. 13. Savings offered in tradeoff performance by choosing an LCGM over an HCGM.
average savings of approximately 17% and 25% in the
and products, respectively, when a LCGM is chosen
in preference to a HCGM. As for the aggressive optimization
approach, Fig. 13(a)–(c) show the % improvement in various de-
sign tradeoff functions for wordlengths of 5–8 b for LFPPMA,
ELMMA, and KSPPMA, respectively. The figures illustrate sig-
nificant gains in performance.
VI. CONCLUSION
In this paper, a novel hardware-sharing approach for the
problem of fast modular addition has been presented. Three
different parallel-prefix modular adder topologies are intro-
duced. Theoretical estimates in conjunction with 0.13- m
standard-cell implementation results show that the proposed
adders outperform existing designs under various criteria, such
as the delay area, power delay area, and energy
delay. A method for characterizing a modulus in terms of its
associated cost for modular addition has been presented, which
is then validated through actual VLSI implementation. Of the
proposed adders, the ELM-based modular adder, ELMMA,
was the best, on average, across the range of moduli studied.
APPENDIX
Here, the area and delay complexities for the proposed LF-
PPMA and KSPPMA adders are derived along with the com-
plexities of Hiasat’s adder in [21], Piestrak’s adder in [20], and
Bayoumi et al.’s adder in [4]. Unit-gate delay and area estima-
tions of adder components are listed in Table I.
The area of the portion of the LFPPMA tree is equal
to
-
,
where . The area requirement for the sec-
tion of the LFPPMA tree is given by
-
. Hence,
the total area required by the LFPPMA tree is given
by
-
.
The critical path through the LFPPMA tree is given by
-
. Thus, the
total area and delay complexity of LFPPMA when including
the cost of the and MUX modules can be approx-
imated by and
, respectively.
For the computation in the KSPPMA tree, a unit
gate area of
-
is required. The area required to process in
the KSPPMA tree is given by
-
. Hence,
the total area required by the KSPPMA tree is given
by
-
. The
critical path through the KSPPMA tree is given by
-
. Thus,
the total area and delay complexity of KSPPMA can be ap-
proximated by , and
, respectively.
Bayoumi et al.’s adder requires two -bit CPA adders,
one OR gate, and 2 1 multiplexers. Thus, the total
PATEL et al.: NOVEL POWER-DELAY-AREA-EFFICIENT APPROACH TO GENERIC MODULAR ADDITION 1291
hardware requirement using {\rm LFPP}-based adders is
. The circuit delay equals
the delay through two -bit LFPP adders and a multiplexer
(assuming is computed before the sum bits are valid), i.e.,
.
Hiasat’s adder consists of an unit, a CPG unit, a
MUX unit, an -bit computation unit, and a -bit
CLAS unit. The unit requires, at most, HAL
cells and two HA cells. This translates to an area of
and a delay of . The CPG unit requires
HA cells, which consumes a total area of . The
delay through this unit is . The MUX unit
requires a maximum of multiplexers, thus resulting in
a maximum unit-gate area of . The delay through this unit
is equal to that through one multiplexer which is .
The computation unit requires —operators ar-
ranged in a binary tree. Additional area is saved by removing
an AND gate (or P-cell) from each level since there is no input
carry that needs to be propagated from the LSB. This has an
area and time complexity of and , re-
spectively. An additional OR gate is required to compute the ac-
tual as needs to be accounted for. This means that the
area and time complexity for the CLA for unit is formula
and ,
respectively. The CLAS unit is similar to a -bit
fast CPA adder, without the carry generate/propagate circuitry,
and without the circuitry that generates or propagates carries
from the MSB (as has already been computed). For an
LFPP-based implementation, the unit-gate cost in area and delay
is
and it has a delay
of . The
total estimated cost in the delay and area space is given by
, and
, respectively.
For Piestrak’s adder, Adder A computes using an -bit
LFPP adder with , and where is not computed. Thus,
. Assuming
and , then the is composed of each
of the E’-cells and O-cells, and 2 each of the E-cells and P-cells
(HAs), i.e., , and . Adder B has an
area complexity equivalent to a -bit LFPP adder with
plus an OR gate (to compute ). Therefore,
. The critical
path delay through Adder B is given by (assuming
is computed before all the sum bits are valid). Finally,
the -bit MUX has a total area and delay complexity equal to
and
-
, respectively. Therefore, the
total area and delay complexities for Piestrak’s modular adder
are given by
, and , respectively.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for
their valuable comments.
REFERENCES
[1] S. Ming Hwa et al., “An efficient VLSI design for a residue to binary
converter for general balance moduli (2  3; 2 +1; 2  1;2 +3),”
IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 51, no.
3, pp. 152–5, Mar. 2004.
[2] A. A. Hiasat, “VLSI implementation of new arithmetic residue to bi-
nary decoders,” IEEE Trans. Very Large-Scale Integr. (VLSI) Syst., vol.
13, no. 1, pp. 153–8, Jan. 2005.
[3] M. A. Soderstrand et al., Residue Number System Arithmetic: Modern
Applications in Digital Signal Processing. New York: IEEE Press,
1986.
[4] M. A. Bayoumi et al., “A VLSI implementation of residue adders,”
IEEE Trans. Circuits Syst., vol. 34, no. 3, pp. 284–8, Mar. 1987.
[5] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and Its Applications
to Computer Technology. New York: McGraw-Hill, 1967.
[6] W. Wei et al., “RNS application for digital image processing,” in Proc.
4th IEEE Int. Workshop Syst.-on-Chip for Real Time Applications,
Banff, Alta, Canada, Jul. 2004, pp. 77–80.
[7] “Low-power implementation of polyphase filters in quadratic residue
number system,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 2004),
Vancouver, BC, Canada, May 2004, vol. 2, pp. 725–8.
[8] W. L. Freking and K. K. Parhi, “Low-power FIR digital filters using
residue arithmetic,” in Conf. Record 31st Asil. Conf. Signals, Syst.
Comput. (ACSSC 1997), Pacific Grove, CA, Nov. 1997, vol. 1, pp.
739–43.
[9] R. Chaves and L. Sousa, “RDSP: A RISC DSP based on residue number
system,” in Proc. Euromicro Symp. Digital Syst. Design, Belek-An-
talya, Turkey, Sep. 2003, pp. 128–135.
[10] J. M. Rabaey and M. Pedram, Low Power Design Methodologies.
Boston, MA: Kluwer, 1996.
[11] T. Stouraitis and V. Paliouras, “Considering the alternatives in low-
power design,” IEEE Circuits Devices Mag., vol. 17, no. 4, pp. 22–9,
Jul. 2001.
[12] M. N. Mahesh and M. Mehendale, “Low power realization of residue
number system based FIR filters,” in Proc. 13th Int. Conf. VLSI Design,
Calcutta, India, Jan. 2000, pp. 30–33.
[13] A. D’Amora et al., “Reducing power dissipation in complex digital
filters by using the quadratic residue number system,” in Conf. Rec.
34th Asil. Conf. Signals, Syst. Comput. (ACSSC 2000), Pacific Grove,
CA, USA, Oct.-Nov. 2000, vol. 2, pp. 879–83.
[14] W. A. Chren, Jr, “One-hot residue coding for low delay-power product
CMOS design,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal
Process., vol. 45, no. 3, pp. 303–313, Mar. 1998.
[15] M. N. Mahesh and M. Mehendale, “Improving performance of high
precision signal processing algorithms on programmable DSPs,” in
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 1999), Orlando, FL,
May-Jun. 1999, vol. 3, pp. 488–491.
[16] G. Dimitrakopoulos et al., “A family of parallel-prefix modulo 2n -1
adders,” in Proc. IEEE Int. Conf. Application-Specific Syst., Arch.,
Processors (ASSAP 2003), The Hague, Netherlands, Jun. 2003, pp.
315–325.
[17] R.A. Patel et al., “Power-delay-area efficient modulo 2 + 1 adder
architecture for RNS,” Electron. Lett., vol. 41, no. 5, pp. 231–2, Mar.
2005.
[18] F. Taylor, “A single modulus complex alu for signal processing,” IEEE
Trans. Acoust., Speech Signal Process., vol. 33, no. 5, pp. 1302–15,
Oct. 1985.
[19] M. Dugdale, “VLSI implementation of residue adders based on binary
adders,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
vol. 39, no. 5, pp. 325–9, May 1992.
[20] S. J. Piestrak, “Design of high-speed residue-to-binary number system
converter based on chinese remainder theorem,” in Proc. IEEE Int.
Conf. Comput. Design (ICCD 1994), Cambridge, MA, Oct. 1994, pp.
508–511.
[21] A. A. Hiasat, “High-speed and reduced-area modular adder structures
for RNS,” IEEE Trans. Comput., vol. 51, no. 1, pp. 84–9, Jan. 2002.
[22] B. Parhami, Computer Arithmetic: Algorithms and Hardware De-
signs. Oxford, U.K.: Oxford Univ. Press, 2000.
[23] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. Assoc.
Computing Machinery, vol. 27, no. 4, pp. 831–8, Oct. 1980.
[24] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
solution of a general class of recurrence equations,” IEEE Trans.
Comput., vol. C-22, no. 8, pp. 786–92, Aug. 1973.
[25] S. Knowles, “A family of adders,” in Proc. 15th IEEE Symp. Comput.
Arith. (ARITH-15), Vail, CO, Jun. 2001, pp. 277–81.
1292 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 54, NO. 6, JUNE 2007
[26] T. P. Kelliher et al., “ELM-a fast addition algorithm discovered by a
program,” IEEE Trans. Comput., vol. 41, no. 9, pp. 1181–1184, Sep.
1992.
[27] C. Nagendra et al., “Area-time-power tradeoffs in parallel adders,”
IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43,
no. 10, pp. 689–702, Oct. 1996.
[28] N. Takagi and S. Yajima, “Modular multiplication hardware algo-
rithms with a redundant representation and their application to rsa
cryptosystem,” IEEE Trans. Comput., vol. 41, no. 7, pp. 887–91, Jul.
1992.
[29] R. Zimmermann, “Efficient VLSI implementation of modulo (2 1)
addition and multiplication,” in Proc. IEEE 14th Symp. Comput. Arith
(ARITH-14), Adelaide, SA, Australia, Apr. 1999, pp. 158–167.
[30] A. Tyagi, “A reduced-area scheme for carry-select adders,” IEEE
Trans. Comput., vol. 42, no. 10, pp. 1163–1170, Oct. 1993.
[31] R. Burch et al., “A monte carlo approach for power estimation,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 1, no. 1, pp. 63–71,
Mar. 1993.
[32] Virtual Silicon 0.13-m high density standard-cell library, Virtual Sil-
icon Technology Inc., Sunnyvale, CA, 2000.
[33] M. Pedram, “Power minimization in ic design: Principles and appli-
cations,” ACM Trans. Design Autom. Electron. Syst., vol. 1, no. 1, pp.
3–56, Jan. 1996.
Riyaz A. Patel received the First Class in the B.Eng.
(Hons.) degree in electronics and communications
engineering and the M.Sc. degree in radio commu-
nications from the University of Leeds, Leeds, U.K.,
in 2000 and 2001, respectively, and the Ph.D. degree
in electronics from the University of Sheffield,
Sheffield, U.K., in 2006.
He is currently a Software and Hardware Consul-
tant for Detica Ltd., Surrey, U.K.
Mohammed Benaissa (S’86–M’90–SM’06) re-
ceived the Dip. Ing. degree in Electronic Engineering
from Ecole Polytechnique d’Alger, and the Ph.D.
degree in VLSI processing from University of
Newcastle, Newcastle upon Tyne, U.K., in 1985 and
1990, respectively.
He is currently a Senior Lecturer with the Univer-
sity of Sheffield, Sheffield, U.K. He has been actively
working in the area of VLSI signal processing, error-
control coding and cryptography for the past 18 years
and has published over 80 papers in recognized jour-
nals and conferences. He has particular interest in number theory and finite
number systems and their applications.
Neil Powell received the B.Sc degree in electrical and electronic engineering
from John Moores University, Liverpool, U.K., in 1985.
From 1985 to 1989, he was an Electronic Engineer with a VLSI Design
Group, British Aerospace PLC, U.K. He is currently a Senior Experimental Of-
ficer with the University of Sheffield, Sheffield, U.K. His research interests in-
clude VLSI design, low-power design, and residue number systems.
Said Boussakta (S’86–M’90–SM’04) received the Ph.D. degree in signal pro-
cessing from the University of Newcastle, Newcastle upon Tyne, U.K., in 2006.
He joined the University of Newcastle, Newcastle, U.K., in October 2006 as
a Professor of Communication Networks. Prior to that, he was with the Univer-
sity of Leeds, Leeds, U.K. His research interests are in signal/image processing,
cryptography and digital communications. He has published over 100 technical
papers.
