Design of RNS Reverse Converters with Constant Shifting to Residue Datapath Channels by unknown
J Sign Process Syst
DOI 10.1007/s11265-017-1238-6
Design of RNS Reverse Converters with Constant Shifting
to Residue Datapath Channels
Piotr Patronik1 · Stanisław J. Piestrak2
Received: 8 December 2015 / Revised: 24 November 2016 / Accepted: 1 March 2017
© The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract This paper presents a new general approach to sim-
plify residue-to-binary (reverse) converters for a Residue
Number System (RNS) composed of an arbitrary set of
moduli. It is suggested to formulate the basic equation of
the reverse converter in a form consisting of two separate
parts: one depending on input variables of the converter
whereas the other is a single constant. Then, the constant,
instead of being added inside the reverse converter, can be
shifted out to the residue datapath channels, in most cases
at no hardware cost or extra delay. Thus, the hardware
cost of the converter is reduced, because its multi-operand
adder has one operand less to handle. To illustrate various
design issues of this new design approach and to prove its
efficiency, a new design method of the residue-to-binary
(reverse) converters for the 3-moduli set {2n −1, 2n, 2n +1}
is considered. Two versions of the new converters for the
3-moduli set {2n − 1, 2n, 2n + 1} as well as several of
their known counterparts were synthesized for all dynamic
ranges from 8 to 38 bits (i.e., for 3 ≤ n ≤ 13). The
results obtained suggest that, compared to the best of the
state-of-the-art converters, at least one of two versions of





1 Department of Computer Engineering (W-4/K-9), Wrocław
University of Technology, 50–370 Wrocław, Poland
2 Res. Team MAE, Institut Jean Lamour (UMR 7198 CNRS),
Universite´ de Lorraine, 54506 Vandœuvre-Les-Nancy, France
consumption, for all dynamic ranges considered, in some
cases accompanied by slight delay reduction. The area is
reduced from about 5 % to about 20 % and the largest sav-
ings are observed for the power consumption—from over
10 % up to 27 %.
Keywords Residue number system (RNS) · Reverse
converter · Residue-to-binary converter · Computer
arithmetic · Digital signal processing (DSP)
1 Introduction
The Residue Number System (RNS) offers several well doc-
umented advantages over the conventional 2’s complement
binary number system [16]. One of them is that the basic
arithmetic operations like addition, subtraction, and mul-
tiplication can be carried out simultaneously in a number
of parallel independent datapaths on relatively short num-
bers. It is therefore particularly well suited for hardware
implementation of a typical computational problem in many
Digital Signal Processing (DSP) systems, as the calculation




Cj · Xj , (1)
where: SN is the numerical value of the function computed,
Xj is the j -th of the series of N input operands, and Cj is
the j -th of the series of N a priori known coefficients (which
could be loaded at the system initialization). The RNS rep-
resentation allows to execute inner product computations
like those given in Eq. 1 using virtually carry-free arithmetic
allowing for area, time, and power consumption savings
compared to its 2’s complement positional counterpart.
J Sign Process Syst
Most digital systems operate on data using a positional
representation of numbers, hence using a non-positional
RNS representation of numbers in some computational
blocks requires conversions of the numbers back and forth
to RNS, performed respectively by reverse and forward con-
verters. Because, unlike residue datapaths executing useful
computations, both converters are a pure overhead, it is
desirable to maximally reduce their area, delay, and power
consumption. The forward conversion is conceptually rela-
tively simple, because it is an extraction of residues from a
positional number, which can be implemented in hardware
using residue generators [21, 23]. On the other hand, the
reverse conversion requires application of special methods,
amongst which the most commonly used are the Chinese
Remainder Theorem (CRT) and the Mixed-Radix Conver-
sion (MRC) [16], whose efficient hardware implementations
are significantly more difficult to design than forward con-
version.
The main goal of this paper is to propose a new approach
which could be taken into account to improve all charac-
teristics of the reverse converters for an arbitrary set of
moduli. It relies on the hypothesis that a reverse converter
equation is given in the form in which the variable terms
and a constant are separated. Although the mathematical
expression of such an equation can be easily obtained, the
main difficulty relies on its such a formulation that it can
be efficiently implemented in hardware. Then, once such
an equation has been found, the addition of the constant
can be shifted from the reverse converter to the residue
datapath channels, thus reducing by one the number of
operands handled by the reverse converter. We will show
that the latter can result in some area, delay, and power
consumption savings, which can be achieved virtually at
no cost. The above idea was already suggested by us for
the first time in [19], but only for a particular case of
reverse converters for two RNS moduli sets {2k, 2n−1, 2n+
1, 2n−1 − 1} and {2k, 2n − 1, 2n + 1, 2n+1 − 1} (n even).
Also, in [19], neither any discussion on the possible perfor-
mance degradation of residue datapaths nor the feasibility of
applying this approach to other RNS moduli sets have been
presented.
The choice of the moduli set of an RNS significantly
affects performance of residue datapaths. Particularly effi-
cient implementations of all basic modulo arithmetic oper-
ations and binary-to-residue (forward) converters have
powers-of-two related moduli of the forms 2n and 2n ±
1 which hence are considered low-cost. Consequently, in
search for efficient RNS-based hardware implementations,
several special moduli sets composed exclusively of low-
cost moduli have been proposed. The most intensively
investigated has been the 3-moduli set {2n − 1, 2n, 2n +
1} introduced in 1978 by Jenkins [14], which offers the
(3n − 1)-bit dynamic range with 3-bit resolution: e.g., for
n = 6, 7, and 8, the dynamic ranges available are respec-
tively 17, 20, and 23 bits. Throughout the years, several
reverse converters with steadily improved parameters have
been proposed for this RNS [2–5, 9, 11–13, 22, 31, 32, 34,
35]. Amongst them, the most hardware efficient and high-
speed converters can be designed using the methods from
[12, 34, 35]. To note also that the reverse converter for this
moduli set can be also obtained using recently proposed
design methods of the converter for the general 3-moduli
set {2n − 1, 2k, 2n + 1} with flexible even modulus k [8,
36], applied for the special case of k = n. Some specific
applications of the 3-moduli set {2n − 1, 2n, 2n + 1} include
Finite Input Response (FIR) filters [10, 15, 24, 28]. How-
ever, introducing in 2007 the flexible 3-moduli set {2n −
1, 2k, 2n+1}, accompanied by an efficient reverse converter
for k ≤ 2n [8], has superseded the 3-moduli set with only a
single parameter n. This is because for the dynamic range of
3n−1 bits offered by the former, the equivalent 3-moduli set
{2n−1 − 1, 2n+1, 2n−1 + 1} results in faster and more area-
efficient residue datapaths. Although, no specific designs
have been explicitly considered in the literature, the overall
complexity figures of the latter can be easily obtained from
complexity characteristics of the MACs mod 2n ± 1 and 2k
for all the moduli sets concerned, provided in [10, 15, 25].
As a vehicle to present various facets of constant shifting
to the residue datapath channels, to illustrate its feasibility,
and to show its efficiency, we have chosen designing a new
reverse converter for the 3-moduli RNS {2n −1, 2n, 2n +1}.
(It is worth to note that despite several reverse converter
functions for this moduli set have been proposed to date,
none of those referred earlier has been readily amenable
for such a transformation.) Obviously, in the context of the
state-of-the-art presented above, building another reverse
converter for the 3-moduli set {2n, 2n − 1, 2n + 1}, even
with slightly improved performance, could seem hardly
justifiable. Nevertheless, of significantly more interest is
the possibility of using it not as a stand-alone circuit but
rather as the main building block in various reverse con-
verters for larger multi-moduli RNSs formed by extending
the basic 3-moduli set. The design of the latter architec-
tures is based on the premises of the MRC algorithm.
The 3-moduli set {2n, 2n − 1, 2n + 1} is one of the most
commonly used for such an approach. Examples of the
latter are reverse converters proposed recently for the spe-
cial balanced 4-moduli sets {2n, 2n − 1, 2n + 1, 2n+1 − 1}
(n even) [1, 7], {2n, 2n − 1, 2n + 1, 2n−1 − 1} (n even)
[7], {2n − 1, 2n, 2n + 1, 2n+1 + 1} (n odd) [1, 30], and
{2n + 1, 2n − 1, 2n, 2n−1 + 1} (n odd) [20], 5-moduli sets
{2n+1, 2n−1, 2n+1, 2n+2(n+1)/2 +1, 2n−2(n+1)/2+1} (n
odd) [26], {2n − 1, 2n, 2n + 1, 2n+1 − 1, 2n−1 − 1} (n even)
[6], {2n −1, 2n, 2n +1, 2n+1 +1, 2n−1 +1} (n odd) [18], as
well as many more like e.g., those mentioned in [27]. Hence,
the reverse converter for the above 3-moduli set is one of the
J Sign Process Syst
most crucial parts, whose characteristics significantly affect
performance of the whole family of converters.
This paper is organized as follows. In Section 2, the
theoretical background on RNS and the basic properties
of arithmetic mod 2n − 1 are presented. In Section 3, the
general problem of designing a reverse converter for an
arbitrary RNS moduli set, whose basic equation allows for
shifting a constant to residue datapath channels, as well as
some design suggestions are presented. In Section 4, the
concept of shifting a constant to residue datapath channels
is illustrated on the example of a new design method of the
reverse converters for the 3-moduli set {2n − 1, 2n, 2n + 1},
whose two versions are presented. In Sections 5 and 6, the
complexity evaluation of the reverse converters for the 3-
moduli set, those proposed here and their most efficient
existing counterparts, is presented: it includes both the gate
level estimations of complexity figures as well as more
accurate evaluations of the delay, area, and power efficiency
of all circuits synthesized in 65 nm technology. Finally,




Let X be an integer and m be a positive integer called a
modulus. We define X residue modulo (mod) m as a result
of the integer division of X by m, denoted x = X mod m
or x = |X|m, where usually 0 ≤ |X|m ≤ m−1 (one notable
exception is widely accepted double representation of zero
mod 2n − 1, which has two equivalent representations of n
0’s (0 . . . 00) and n 1’s (1 . . . 11), because in this case 0 ≤
|X|2n−1 ≤ 2n −1). An RNS is a set of numbers defined by a
set of r pairwise prime moduli {m1, . . . , mr}. The dynamic
range M of an RNS is a product of its moduli, i.e., M =∏r
i=1 mi , hence, a positional binary representation of any
RNS number occupies a = log2 M bits. Any number 0 ≤
X < M in an RNS can be represented by an ordered set of
residues X = {x1, . . . , xr}, where xi = |X|mi , 1 ≤ i ≤ r is
represented on ai = log2 mi bits.
Let X, Y , and Z be integers 0 ≤ X, Y,Z < M repre-
sented by the sets of residues as above. Then, any arithmetic
operation ◦ ∈ {+,−,×} on these integers yielding Z =
|X ◦Y |M is equivalent to the same arithmetic operation exe-
cuted on their independent residues, i.e., zi = |xi ◦ yi |mi .
These operations executed on relatively short operands are
the main advantage of RNS representation over its 2’s
complement positional integer counterpart.
A multiplicative inverse of g mod m (0 < g < m) is
such an integer h (0 < h < m) that |hg|m = 1. It exists
provided that g and m are co-prime. It can be written using
the fractional notation as h = |1/g|m. Here, two following
multiplicative inverses will be useful: (i) |1/2n|22n−1 = 2n,
and (ii) |1/(2n + 1)|2n−1 = |1/(2n − 1)|2n+1 = 2n−1.
Reverse converters, which are used to convert from the
RNS to the positional representation of numbers, can be
designed using the following general methods [16, 33].














where mˆi = ∏rj=1,j =i mj , i.e., mˆi is a product of all
moduli but mi .
– The New CRT:
X = |x1 + k1(x2 − x1)m1 + k2(x3 − x2)m1m2







with 1 ≤ i ≤ r − 1.
– The Mixed-Radix Conversion (MRC), which is an itera-
tive method given by



































, 3 ≤ i ≤ r.
For the special case of the 2-moduli set {m1,m2} and
two respective residues {x1, x2}, which will be needed in
Section 4, the simplified version of Eq. 4 is







2.2 Properties of Arithmetic Modulo 2n − 1
For Readers’ convenience, all properties needed in the case
study given in Section 4 are presented here. Let n, s, and
d be arbitrary positive integers, and z, x, y be positive inte-
gers such that 0 ≤ z, x ≤ 2n − 1 and 0 ≤ y ≤ 2sn − 1.
The binary representations of z, x, and y are respectively
(zn−1 . . . z0), (xn−1 . . . x0), and (ysn−1 . . . y0). As for y, if it
is more than (s − 1)n-bit and less than sn-bit number, then
it is preceded by some leading zeros on the most signifi-
cant bit positions. The symbol ‖ denotes the concatenation
J Sign Process Syst
of binary vectors. Then, some basic arithmetic operations
mod 2n − 1 are performed as follows.
– The sign change of z mod 2n−1 (the additive inverse of
z) is obtained by bit-by-bit complementing of all bits,
i.e.,
| − z|2n−1 = |z¯|2n−1 = |(z¯n−1 . . . z¯0)|2n−1.
– The multiplication mod 2n −1 of z by 2d is obtained by
the left cyclic shift of z by d positions, i.e., |2dz|2n−1 =
|(zn−d−1 . . . z0zn−1 . . . zn−d)|2n−1, and easily imple-
mented at no hardware cost.
– The multiplication mod 2n − 1 of z by 2−d = 1/2d
(i.e., the multiplication mod 2n−1 by the multiplicative
inverse of 2d ) is obtained by the right cyclic shift of z
by d positions (equivalent to the left cyclic shift of z by
n − d positions, because |2−dz|2n−1 = |2n−dz|2n−1),
i.e.
|2−dz|2n−1 = |z/2d |2n−1 = |(zd−1 . . . z0zn−1 . . . zd)|2n−1.
– For any non-negative integer i, |2in|2n−1 = 1. There-
















i.e., to obtain the residue y mod 2n−1, it simply suffices
to partition y into s n-bit parts and sum up all of them
mod 2n − 1.
3 General Scheme of Constant Shifting
for an Arbitrary RNS Moduli Set
In this section, we will present the general approach which
can be used to simplify reverse converters for an arbitrary
RNS moduli set, which relies on shifting constants from
a reverse converter to residue datapath channels. Specifi-
cally, any RNS reverse converter can be seen as a particular
modular datapath composed of a number of constant mul-
tiplications, bit-level manipulations, and finally additions.
Because these arithmetic and logic operations may involve
some variable or fixed operands, whose number and length
determine the amount of required hardware and perfor-
mance, any cost reduction boils down to the reduction of the
number of executed operations and specifically a number of
addition operands. Our goal is to consider the possibility of
transforming the functions of the reverse converter to such
a form that all constants are accumulated to a single one,
which then could be shifted out to the residue datapath, thus
reducing the number of operands handled inside the reverse
converter. As each reverse converter for a given moduli set is
a kind of custom designed datapath, isolating the constants
can be done only on a case-by-case basis, depending on the
moduli set and the architecture of the converter. An inspi-
ration to present such a general framework for an arbitrary
RNS reverse converter was our earlier results obtained for
the reverse converters for two RNS moduli sets {2k , 2n − 1,
2n +1, 2n−1 −1} and {2k, 2n −1, 2n +1, 2n+1 −1} (n even)
[19].
The inspection of the basic formulas for the reverse
conversion given by Eqs. 2–4 reveals that the positional rep-
resentation of the output variable X could only be the sum
mod M (for the CRT and the New CRT) or the simple sum
(for MRC) of products of only one residue xi by some con-
stant bi (1 ≤ i ≤ r), i.e., there never occurs any single
term involving the products of two residues like xixj . In
particular, any CRT-based reverse converter function can be












where bi are some constant coefficients of the residues xi
and Ck is some total constant.
It is important to recall that, because it is implicitly
assumed that the RNS considered is a non-redundant one,
each of r moduli contributes to the dynamic range M =∏r
i=1 mi . Therefore, X must depend on all coefficients bi ,
which implies that all bi = 0 and the final addition has at
least r operands. The actual number of operands of the lat-
ter, p ≥ r , depends on the possibility of obtaining simple
binary expressions for |bixi |M or bixi (for MRC). Although
either of the latter products can be implemented using ROM
look-up tables, on one hand, it could be prohibitively com-
plex for larger ai , and on the other hand, for many moduli
sets with arbitrary ai the most efficient implementations
have been obtained using arithmetic circuits (adders and
subtractors) and some bit manipulations.
Obviously, a simple arithmetic expression like Eq. 6
does not necessarily imply that any simple hardware imple-
mentation with the smallest possible number of operands
can be obtained even through skillful bit manipulations.
Nevertheless, the motivation for reducing the number of
operands stems from the fact that it can contribute not only
to reducing area and power consumption (which seems quite
obvious: eliminating one operand results in one up to a-
bit CSA less, i.e., up to a FAs or HAs less, depending on
whether an operand is a variable or a constant), but even in
some cases it can be accompanied by reducing by one the
number of CSA stages on a CSA tree that processes multiple
operands to obtain the final result X.
J Sign Process Syst
Now consider the RNS-based implementation of Eq. 1,











, 1 ≤ i ≤ r. (7)
We assume that Eq. 7 is implemented using N Multiply-
Accumulate units (MACs), in which the j -th stage of the
datapath (1 ≤ j ≤ N) is composed of r MACs mod mi ,








where the initial values are generally assumed |S0|mi = 0.
If the function of the reverse converter can be expressed
as in Eq. 6, then, the residue datapath channels followed
by the reverse converter can be implemented as shown in
the upper part of Fig. 1. Now observe that the addition
of the constant |Ck|M in the reverse converter is equiv-
alent to the set of additions of r constants |ci |mi in the
residue datapaths, 1 ≤ i ≤ r , where |ci |mi = |Ck|mi .
Therefore, the addition of the constant |Ck|M can be shifted
out from the (p + 1)-operand CSA tree mod M of the
reverse converter, because it can be taken into account
earlier at the stage of RNS datapath computations mod-
ulo all moduli mi , resulting in hardware implementation
shown in the lower part of Fig. 1. The case study pre-
sented in the following sections will show that in most
cases such a modification can be done at no hardware
cost.
To facilitate finding the reverse converter function
amenable for shifting out constants to residue datapaths, we
suggest to proceed as follows.
– Identify all modular datapaths inside the reverse converter.
The reverse converter is a network of interconnected
modular datapaths, whose architecture depends both on
the moduli set and the general approach taken by a
designer of the converter, resulting in a composition of
CRT, New CRT, and MRC techniques. As such, each
part of a converter is a chain of additions and multipli-
cations of the residues xi by a constant. Hence, the first
step is to identify such parts in the converter.
– Determine the value of constants added and multiplied



















































Residue to binary (reverse) converter





















Figure 1 General scheme of shifting constants from the reverse converter for an arbitrary RNS moduli set to residue datapath channels, according
to Eq. 6.
J Sign Process Syst
value of the constant for each of the residue datapaths
mi , 1 ≤ i ≤ r .
The most common case of a constant addition is
when numbers are subtracted, following the identity
−xi = x¯i − 1. Alternatively, the addition mod mi of a
negative number is expressed as −xi = mi − xi . Note
also that each residue datapath channel could have its
own cumulative constant.
– Calculate the constants for each channel.
Each datapath has its own modulus mi , which can be
either the individual modulus from the moduli set (e.g.
as in Eq. 6) or the product of the moduli—if the hier-
archical structure of the converter is used (one example
can be found in [19]). In the latter case, the value of
the constant for each respective channel is the residue
of the constant for the datapath; otherwise, the constant
is added only in one channel. Because various datap-
aths inside a converter may result in various constants,
in such a case, the constants are accumulated in chan-
nels modulo their respective moduli and added only
once.
– Reformulate and simplify the functions of the con-
verter with shifted out constants, including reorgani-
zation of modular CSAs and new alternative bit-level
manipulations.
With the constants eliminated from a given datap-
ath of the reverse converter, ones can be replaced with
zeros on all relevant bits. The latter is done by replacing
FAs in one stage of the CSA tree with HAs or even by
completely eliminating one stage of them. Moreover, if
some bit combinations in datapaths do not occur, FAs
can be replaced with simpler AND/OR logic operators.
Figure 2 Block diagram of the
new converter for the 3-moduli
set {2n − 1, 2n, 2n + 1}: a
general schema; b bit-level
manipulations for variable w2;
c, d bit-level manipulations for
variables w1 and w3
































J Sign Process Syst
In the next section, we will show how to apply these gen-
eral steps to the special case of the new reverse converter for
the 3-moduli set {2n − 1, 2n, 2n + 1}.
4 Design of Reverse Converters for the 3-Moduli
Set {2n − 1, 2n, 2n + 1}
In this section, we will first detail a new method for design-
ing reverse converters that implements a newly introduced
set of equations in which variable and constant terms could
be separated. Then, we will explain a new approach to
improving converter performance that relies on shifting con-
stants added by the converter out to the datapath channels
and will argue that the latter operation in most cases can be
done at no cost. A general logic scheme of the new converter
is presented in Fig. 2.
4.1 New Basic Functions
The basic functions of the reverse converter for the set of
three moduli {m1,m2, m3} = {2n, 2n −1, 2n +1} have been
given in many papers, notably in [12, 34, 35] which pro-
posed designs currently considered the most efficient. The
latter three methods start with the CRT (or the so-called New
CRT in case of [34]) and then bring the problem of the con-
version to a Multi-Operand Mod 22n − 1 Addition (MOMA
mod 22n − 1). Because the final addition mod 22n − 1
seems inevitable in this RNS, all the efforts to increase the
efficiency of the converter have been concentrated on the
problem of reducing the number of MOMA operands. There
are three operands in the case of [12, 35] and four in the
case of [8, 34, 36] (although in [34], the fourth operand is
reduced to one bit). The varying composition of MOMA
operands is mainly achieved by various bit-level manipu-
lations, with the exception of the converter of [36]. In all
previous designs, the MOMA operands are composed of
variable parts depending on three input residues x1, x2,
and x3 mixed up with some constants. We have analyzed
equations of all of the above converters and we have not
found any obvious method to separate them. Here, our idea
is to show that could a constant be isolated from variable
parts, it would be unnecessary to add that constant within
the converter. Therefore, we propose first to accumulate the
constants to a single total constant and then to move its addi-
tion out of the converter to the residue datapath channels. It
will be seen that this approach gives new options of bit-level
manipulations being either a simplified version of the exist-
ing design from [35] or a new proposition which has not yet
been explored in the open literature. In Section 4.4.2, we
will argue that the addition of the constants in the residue
datapaths in most cases can be done at no cost and that in a
few special cases its cost is negligible in comparison to the
benefits resulting from the simplification of the converter.
Our new design method requires execution of the following
two steps that are eventually merged in a single arithmetic
block.
The first step relies on applying the CRT of Eq. 2 to the
2-moduli set {m2,m3} = {2n − 1, 2n + 1} which provides






































2n + 1) x2 +
(
2n − 1) x3
)∣∣∣
22n−1 . (9)
The second step relies on applying the MRC of Eq. 5
(similarly as in [29]) to the 2-moduli set {m1,m2m3}, which
provides the final result X = {x1, X23} given by











= x1 + 2nXh, (10)
where the term Xh can be computed by replacing X23 with



























Three terms which appear in the above equation will be
transformed through some bit-level manipulations to obtain
expressions not only better adapted for implementations
using logic gates, but also having the advantage that vari-
able parts (i.e., those depending on three input residues
{x1, x2, x3}) will be separated from constant components.


















∣∣∣(2n+1x¯1)2−1 + (2−n − 1)
∣∣∣
22n−1 (12)
J Sign Process Syst
which contains the variable term depending on x1 and the
constant term equal to |2−n − 1|22n−1. The former term can












Note: because the multiplication by 2−1 will appear in mod-
ified expressions for all three variable terms, we have left
it intentionally instead of replacing it with the right cyclic
shift by one bit.
In the second term, because x2(2n + 1) = 2nx2 + x2 and
x2 is an n-bit number, the addition of 2nx2 to x2 is a simple
concatenation of two variables x2, which yields
∣∣∣x2
(






Finally, the third term can be rewritten as
∣∣∣x3
(













((x3,n−1 . . . x3,0)‖(0 . . . 0︸ ︷︷ ︸
n−1
)‖x3,n)
+((1 . . . 1︸ ︷︷ ︸
n−1











(x3,n−1 . . . x3,0)‖(x¯3,n−1 . . . x¯3,0)
)
+((1 . . . 1︸ ︷︷ ︸
n−1













(x3,n−1 . . . x3,0)‖(x¯3,n−1 . . . x¯3,0)
)
+((0 . . . 0︸ ︷︷ ︸
n−1










Now, consider the variable part of Eq. 15 involving x3,
denoted by A:
A = (x3,n−1 . . . x3,0‖x¯3,n−1 . . . x¯3,0
)
+(0 . . . 0︸ ︷︷ ︸
n−1
‖x¯3,n‖ 0 . . . 0︸ ︷︷ ︸
n−1
‖x3,n). (16)
First, note that if x3 < 2n then x3,n = 0, so that




((0 . . . 0︸ ︷︷ ︸
n−1
)‖1‖(0 . . . 0︸ ︷︷ ︸
n
)) . (17)




⎝(0 . . . 0)‖ (1 . . . 1)︸ ︷︷ ︸
n
⎞





((0 . . . 0︸ ︷︷ ︸
n−1
)‖1‖(0 . . . 0︸ ︷︷ ︸
n
)) . (18)
Consequently, we consider two special cases depending on
the value of x3,n. First, note that if x3,n = 1, then all other
bits of x3 are equal to 0. We will try to explore this fact
either to reduce the size of A from two vectors in Eq. 16 to
one vector, or to change the distribution of the bits of x3 in
such a way that it would enable merging it with the other
operands, e.g., with the vector of Eq. 13. We have found
two expressions for variable A of which the first is inspired
by [35], while the second has not been explored in the open
literature yet:
1. Addition of the constant 2n and selective setting of n
most-significant bits of the first vector as in Eqs. 17 and
18 yields
A = ((x3,n−1 ∨ x3,n) . . . (x3,0 ∨ x3,n)‖x¯3,n−1 . . . x¯3,0
)+2n.
(19)
2. Merging of two cases from Eqs. 17 and 18 by clearing
one least-significant bit and adding x3,n on the position
1 yields
A = (x3,n−1 . . . x3,0‖x¯3,n−1 . . . x¯3,1‖(x3,0 ∨ x3,n))
+(0 . . . 0︸ ︷︷ ︸
n−1
‖x¯3,n‖ 0 . . . 0︸ ︷︷ ︸
n−2
‖x3,n‖0). (20)
One can easily verify that Eq. 17 (18) can be obtained mod
22n − 1 by setting x3,n = 0 (x3,n = 1) in Eqs. 19 and 20.
Now, depending on which of Eqs. 19 and 20 is selected, two
alternative versions of the converter can be obtained.
4.2 Version 1
First, we merge the constants |(2−n − 1)|22n−1, |2−1(1 −
2n+1)|22n−1, and |2n2−1|22n−1 which appear in respective
Eqs. 12, 15, and 19 to obtain one total constant Ck given by
Ck =








∣∣∣2n−1 + 2−1 − 1
∣∣∣
22n−1 = 2
2n−1 + 2n−1 − 1
= (1 0 . . . 0︸ ︷︷ ︸
n+2
1 . . . 1︸ ︷︷ ︸
n−1
). (21)
Second, we define new variables w1, w2, and w3 involv-
ing variable terms from respective Eqs. 12, 14, and 19, each
J Sign Process Syst
involving respectively exactly one of the input residues x1,
x2, and x3. For convenience, when rewriting Eqs. 12, 14, and
19 we have performed multiplication by |2−1|22n−1 (here, it
is a right cyclic shift by one bit over 2n bits) to obtain three
following raw binary vectors:
w1 = (x¯1,n−1 . . . x¯1,0‖ (0 . . . 0)︸ ︷︷ ︸
n
), (22)
w2 = (x2,0‖x2,n−1 . . . x2,0‖x2,n−1 . . . x2,1), (23)
w3 =
(
x¯3,0‖(x3,n−1 ∨ x3,n) . . . (x3,0 ∨ x3,n)
‖x¯3,n−1 . . . x¯3,1
)
. (24)
Now Eq. 11 can be rewritten using these new variables









= |w1+w2+w3+Ck|22n−1 . (25)
Version 1 of the converter can be obtained by substituting
in Eq. 10 Xh with its expression of Eq. 25 and designing
all circuitry as shown in Fig. 2a)–c). We have intention-
ally omitted Ck in Fig. 2, as it is expected to be already
included in input residues x2 and x3, as described below in
Section 3.4.
4.3 Version 2
First, we combine the variable terms from Eqs. 13 and 15
according to Eq. 20 and subtract the constant (1 − 2n+1)
from Eq. 15 to obtain





(x¯1,n−2 . . . x¯1,0‖(0 . . . 0︸ ︷︷ ︸
n
‖x¯1,n−1)
+(0 . . . 0︸ ︷︷ ︸
n−1
‖x¯3,n‖ 0 . . . 0︸ ︷︷ ︸
n−2
‖x3,n‖0)






(x¯1,n−2 . . . x¯1,0‖(x¯3,n‖0 . . . 0)‖x3,n︸ ︷︷ ︸
n+1
‖x¯1,n−1)






Next, we merge the constants |2−n − 1|22n−1 and |2−1(1 −
2n+1)|22n−1 from respective Eqs. 12 and 15 into one total
constant Ck given by
Ck =










= 22n−1 − 1 = 0 1 . . . 1︸ ︷︷ ︸
2n−1
). (27)
As in Version 1, we define two variables w1 and w3
involving the input residues x1 and x3 from Eq. 26 (which
are actually slightly different than in Version 1) whereas the
variable w2 remains the same as before in Eq. 23. When
rewriting Eq. 26 we have performed the multiplication by
|2−1|22n−1 (the right cyclic shift by one bit over 2n bits) to
obtain two following raw binary vectors:





(x3,0 ∨ x3,n)‖x3,n−1 . . . x3,0‖(x¯3,n−1 . . . x¯3,1)
)
. (29)
Version 2 of the converter, whose circuitry is shown in
Fig. 2a, b, and d, can be obtained by substituting in Eq. 10
Xh with its expression of Eq. 25 wherein the variables w1,
w3, and Ck are given respectively by Eqs. 28, 29, and 27.
4.4 Elimination of the Constant Ck
In this section, we will show how the general ideas pre-
sented in Section 3 can be applied in practice.
4.4.1 Theoretical Background
Besides three variable components w1, w2, and w3, Eq. 25
for Xh contains one global constant Ck given by Eq. 21
for Version 1 and by Eq. 27 for Version 2. Notice how-
ever that the addition of Ck in the calculation of Xh has
the same effect as the addition of 2nCk to X23. Even more,
the latter may be also achieved by adding simultaneously
c2 = |2nCk|2n−1 = |Ck|2n−1 to x2 mod 2n − 1 and
c3 = |2nCk|2n+1 = | − Ck|2n+1 to x3 mod 2n + 1 (the
even residue x1 is left unmodified because |2nCk|2n = 0). In
summary, the constant Ck indeed can be added either by the
CSA part of the reverse converter or beforehand in the data-
path channels mod 2n ± 1. The latter possibility is attractive
and feasible, because the addition of these constants may be
performed at no cost, as they may be initially loaded as the
starting values or merged with other constants in the residue
datapath channels.
Henceforth, the addition of Ck in Xh is omitted and
Eq. 25 for Xh can be rewritten as
Xh = |w1 + w2 + w3|22n−1 , (30)
which clearly shows that MOMA mod 22n−1 has only three
operands. The closed-form expressions for the constants to
be added by the datapath channels are as follows.
For Version 1, from Eq. 21 we obtain
c2 = |2n(22n−1 + 2n−1 − 1)|2n−1 = 0, (31)
c3 = |2n(22n−1 + 2n−1 − 1)|2n+1
= |2n(−2n−1 + 2n−1 − 1)|2n+1 = 1. (32)
It means that the datapath channel mod 2n −1 does not have
to be modified at all, whereas the value produced by the
J Sign Process Syst
datapath channel mod 2n + 1 only needs to be incremented
by c3 = 1. For Version 2, from Eq. 27 we obtain
c2 = |2n(22n−1 − 1)|2n−1 = 2n−1 − 1 = (0 1 . . . 1︸ ︷︷ ︸
n−1
), (33)
c3 = |2n(22n−1 − 1)|2n+1 = |2n(−2n−1 − 1)|2n+1
= 2n−1 + 1 = (01 0 . . . 0︸ ︷︷ ︸
n−2
1). (34)
4.4.2 Shifting of Constants to the Datapath Channels
Figure 3 shows the RNS datapath using the 3-moduli set
{2n − 1, 2n, 2n + 1} and the reverse converter which real-
izes Eq. 10 in which Xh is given by Eq. 30. We assume that
the initial version of the RNS datapath (prior shifting out
the constants from the reverse converter) implements the set
of r = 3 equations (7), in which the j -th stage of the dat-
apath (1 ≤ j ≤ N) is composed of three MACs mod mi ,
1 ≤ i ≤ 3, each of which realizes Eq. 8, where the ini-
tial values are generally assumed |S0|mi = 0. All MACs
(mod 2n, 2n − 1, and 2n + 1) can be implemented using the
design approach presented in [10]. Also, we suggest two fol-
lowing solutions for adding the constants c2 and c3 in their
respective datapath channels mod 2n − 1 and 2n + 1.
Solution 1 Suppose that each computation is initialized by
activating the Clear signals in the registers |S0|mi containing
the initial values mod mi , i = {1, 2, 3}. Then, the effect of
adding mod 2n − 1 and 2n + 1 the respective constants c2
and c3 in the residue datapaths mod m2 = 2n − 1 and m3 =
2n +1 can be obtained without any extra delay and virtually
with no hardware cost by activating in these registers the
Preset signal allowing to load the constants c2 and c3 in the
channels mod 2n − 1 and 2n + 1.
Solution 2 Should the above simplest solution be unfea-
sible for some reason, which seems rather unlikely, the
constants c2 and c3 can be added at any other stage of com-
putation in respective residue datapath channels. We will
detail the case of Version 2, because adding mod 2n − 1
and 2n + 1 of the respective constants c2 = 2n−1 − 1 and
c3 = 2n−1 + 1 is more difficult than for the case of Ver-
sion 1, for which always c2 = 0 and c3 = 1, and which will
appear later as a special case.
First, consider the datapath channels mod 2n−1. Concep-
tually the simplest MAC mod 2n −1 [10] is nothing else but
the array of n 2-input AND gates followed by the (n + 1)-
operand MOMA mod 2n − 1. Obviously, no modifications
are required for Version 1, because c2 = 0. For Version
2, the addition of c2 requires introducing somewhere in
the datapath one CSA stage composed of n half-adders
(with carry outputs non-complemented or complemented,
depending on whether a given bit of c2 is 0 or 1, respec-
tively). It can be done by modifying an arbitrarily selected
MAC of the datapath mod 2n − 1. Such a modification































Figure 3 Scheme of shifting constants to residue datapath channels in the reverse converter for the 3-moduli set {2n − 1, 2n, 2n + 1}.
J Sign Process Syst
of the MAC mod 2n − 1, except for those n for which
θ(n+2) = θ(n+1)+1, where θ(p) denotes the number of
CSA stages on a CSA tree that processes p input operands),
i.e., only for n = 3, 5, 8, 12, 18, 27, etc.
As for the datapath channel mod 2n + 1, several possibil-
ities exist, which do not involve any changes which could
result in the increasing of the area or delay. However, to
clarify this issue, we must first recall some details regard-
ing the internal logic structure of a MAC mod 2n + 1, e. g.
one proposed in [10]. The latter design employs the tree of
n-bit CSAs with some signals complemented and, in par-
ticular, complemented End-Around Carry (EAC) signals. In
a given arithmetic circuit mod 2n + 1, any complemented
signal of weight 2i requires to add mod 2n + 1 the correc-
tive constant equal to | − 2i |2n+1 [21]. For such a circuit,
a cumulative total constant |Ctotal |2n+1 can be calculated
and added whenever the most convenient, i.e., it does not
have to be added at each stage of the datapath mod 2n + 1,
unless an intermediate unbiased result is actually needed. In
the best case, a cumulative total constant |Ctotal |2n+1 can
be added only once, e.g., at the final stage of the datapath
channel mod 2n + 1, when the final result |SN |2n+1 = x3 is
produced, which is one of three input signals of the reverse
converter considered here (recall also that x3 assumes only
the valid values mod 2n + 1, i.e., 0 ≤ x3 ≤ 2n). Obvi-
ously, for all cases when |Ctotal |2n+1 = 0, the addition
of c3 can be done at no cost by replacing the addition
of |Ctotal |2n+1 with the addition of |Ctotal + c3|2n+1. In
the remaining special case of |Ctotal |2n+1 = 0, the adder
mod 2n + 1 used at the final stage of the datapath chan-
nel mod 2n + 1, which adds the pair of n-bit vectors in
the carry-save form C∗ = (c∗n−2 . . . c∗0c∗n−1) and S∗ =
(s∗n−1 . . . s∗0 ) to compute x3 = |C∗ + S∗|2n+1, must be mod-
ified, as it should produce |C∗+S∗+c3|2n+1 now. The most
obvious general implementation of the latter involves intro-
ducing one n-bit CSA with complemented EAC (the latter
imposes to subtract −1, so that the adder actually computes
|C∗ + S∗ + c3 − 1|2n+1) followed by the adder mod 2n + 1.
Such a modification involves the extra delay of one half-
adder and the area cost of n extra half-adders (which still
reduces by a half the total of 2n half-adders which would be
used, should this constant be added within the reverse con-
verter). Finally, the special case of |Ctotal + c3|2n+1 = 1
(which is also the case of Version 1) involves no extra
cost at all, because the addition |C∗ + S∗ + 1|2n+1 can
be implemented using the n-bit adder with inverted EAC
of [37].
4.4.3 Example
Consider an RNS-based implementation of the 16-tap FIR
filter with 8-bit input operands and 8-bit coefficients. The
required dynamic range of 20 bits is guaranteed by the
3-moduli set with n = 7 {m1,m2,m3} = {128, 127, 129}.
Eq. 1 is implemented using three residue datapaths mod mi ,












where the initial value of |S0|mi is usually set to 0. Three
basic MAC units mod 128, 127, and 129, required to
built these residue datapaths, can be implemented using the
design approach presented in [10], although some slightly
improved versions proposed in [25] can be used as well.
The general scheme is as shown in Fig. 3, assuming that
the following constants are added inside the residue data-
path channels: c2 = 0 and c3 = 1—for Version 1, and
c2 = 63 and c3 = 65—for Version 2. Should it be fea-
sible for an actually used architecture, at the beginning of
computations the residue datapaths registers containing the
initial values |S0|2n−1 and |S0|2n+1 are loaded with the suit-
able constants required by the reverse converter rather than
with 0’s, i.e., |S0|2n−1 = c2 and |S0|2n+1 = |Ctotal +
c3|2n+1. Should loading of initial non-zero constants c2
and c3 be unfeasible, some of the alternative methods pro-
posed in the previous subsection could be used. Further, in
this example, we disregard the actual value of |Ctotal |2n+1
(the only other constant to be added, which can be calcu-
lated according to the internal structure of the MAC mod
2n + 1), take into account c2 and c3, and assume that
the result of the calculations from the residue datapaths
{x1, x2, x3} = {1, 2, 3}.
For Version 1, adding the values of c2 = 0 and c3 = 1 to
the appropriate datapath channels results in {x1, x2, x3} =
{1, 2, |1 + 3|129} = {1, 2, 4}, which is in binary
{x1, x2, x3} = {(0000001), (0000010), (00000100)}. From
Eqs. 22, 23, and 24 respectively, we have:











(x¯3,0‖(x3,n−1∨x3,n). . .(x3,0∨x3,n))‖(x¯3,n−1. . .x¯3,1)
)
=(1‖0000100‖111101)=8509.
From Eq. 30 we obtain Xh = |w1 + w2 + w3|22n−1 =
|16128 + 129 + 8509|16383 = 8383, which substitutes Xh
in Eq. 10 to provide the final value of X given by X =
x1 + 2nXh = 1 + 27 · 8383 = 1073025.
For Version 2, adding the values of c2 = 63 and
c3 = 65 to the appropriate datapath channels results in
{x1, x2, x3} = {1, |2 + 63|127, |3 + 65|129} = {1, 65, 68},
J Sign Process Syst
{x1, x2, x3} = {(0000001), (1000001), (01000100)}. From
Eqs. 28, 23, and 29 respectively, we have:
w1 = ((x¯1,n−1 . . . x¯1,0)‖x¯3,n‖ (0 . . . 0)︸ ︷︷ ︸
n−2
‖x3,n)
= (11111101‖00000‖0) = 16192,
w2 = (x2,0‖(x2,n−1 . . . x2,0‖x2,n−1 . . . x2,1))
= (1‖1000001‖100000) = 12384,
w3 =
(
(x3,0 ∨ x3,n)‖(x3,n−1 . . . x3,0)‖(x¯3,n−1 . . . x¯3,1)
)
= (1‖1000100‖011101) = 12573.
From Eq. 30 we obtain the same value of Xh = |w1 +w2 +
w3|22n−1 = |16192 + 12384 + 12573|16383 = 8383.
In summary, we have shown the feasibility of avoiding
the cost of one additional CSA stage in our new convert-
ers, by assuming that the appropriate constants are already
added to the input residues x2 and x3 of the converter.
4.5 Constant Removal From the Reverse Converters
for 4- and 5-moduli sets
The same rules of shifting out constants as presented above
can be directly applied to other reverse converters for the
special multi-moduli sets constructed by extending the clas-
sic 3-moduli set {2n − 1, 2n, 2n + 1}, like those of [1,
6, 20], in which the reverse converter for the 3-moduli
set {2n − 1, 2n, 2n + 1} can be used as a subcircuit. For
instance, it can be done by first dividing the moduli set
into at least two subsets, one of which is the 3-moduli set
{2n−1, 2n, 2n+1}. The residues corresponding to this mod-
uli set are transformed using the reverse converter to obtain
one number, which is an intermediate result for this moduli
set. (In particular, it can be the converter proposed above,
provided that the necessary constants are added in residue
datapath channels mod 2n ± 1.)
Next, the two-moduli MRC from Eq. 5 is applied once
or twice to complete the reverse converter for the 4- or 5-
moduli set, respectively. The application of the MRC in
this step is nothing else but the modulo generation from
the number obtained in the preceding step followed by the
modular subtraction and multiplication by the multiplica-
tive inverse. Depending on a specific implementation of this
step, a number of constants can appear. As this part of the
converter is a kind of residue datapath modulo one of the
two last moduli, the constants involved in these steps (or one
last step, in the case of the 4-moduli set) can then be shifted
out to the residue datapath channels corresponding to these
moduli.
5 Complexity Evaluation
In this section, we will evaluate the gate-level complexity of
a number of converters for the 3-moduli set {2n, 2n−1, 2n+
1}, including: Versions 1 and 2 of our converters, three spe-
cific designs which in the literature have been considered
to be the most efficient [12, 34, 35], and two converters for
the general 3-moduli set {2k, 2n − 1, 2n + 1} designed for
the special case of k = n [8, 36]. (To note that because the
expressions used to implement all the converters of [12, 34,
35] contain no explicit constants, our technique of shifting
the constants to the datapath channels can be applied only to
the converters proposed here.) Here, we count logic gates,
based on a number of basic primitives used in the archi-
tectural design detailed in the previous section, and then
we estimate the circuit delay as the number of logic prim-
itives present on the critical path. In the next section, we
will evaluate the results of logic synthesis of all convert-
ers using commercial design tools, which would provide not
only more accurate estimations of the area and delay but
also of the power consumption.
5.1 Hardware Complexity Evaluation
In Table 1, which summarizes the gate-level hardware com-
plexity, we can distinguish three groups of components. The
first one consists of bit-level manipulation components like
primitive gates, inverters and MUXes (which we also treat
as primitive gates). As in every converter mentioned there
is a 3-operand [12, 35] or a 4-operand [8, 34, 36] addition
mod 22n − 1, the second group are full-adders (FAs) and
Table 1 Hardware complexity
(in [8] and [36] k = n). Element Ver.1 Ver.2 [8] [12] [34] [35] [36]
OR n 1 1 1 – n –
AND – – – 1 – – –
XOR – – – – 1 – –
NOT 2n 2n + 1 2n + 1 2n + 1 3n − 1 2n 2n + 1
MUX – – – – 2 – –
Full-adder (FA) n n + 2 2n + 1 2n 2n 2n 2n + 1
Half-adder (HA) n n − 2 2n − 1 – 2n – 2n − 1
Add. mod 22n − 1 1 1 1 1 1 1 1
J Sign Process Syst
half-adders (HAs) which form 2n-bit CSA stages produc-
ing the carry-save vector pair. The third group is the final
adder mod 22n−1 reducing the carry-save pair into the final
result. The differences between converters are revealed in
the first two groups of components, because the same final
adder mod 22n − 1 occurs in all converters. It is seen that
Version 1 of our converter is comparable with the converter
from [35] (as it uses similar design assumptions), while in
the other converters (notably in Version 2 of our converter)
the number of two-input gates is fixed and does not depend
on n. A real advantage of our new design (resulting from
removal of the constants) may be seen in the total number
of full- and half-adders. While in all converters there are at
least 2n full-adders accompanied in [8, 34, 36] by about 2n
half-adders, in both versions of our design the number of
full- and half-adders is about n.
In summary, the above estimations suggest that our con-
verters should occupy slightly smaller area resulting from
smaller number of full- and half-adders.
5.2 Gate-Level Delay Evaluation
Table 2 provides the estimations of the delay of all con-
verters considered. Each column shows the data of one
converter, with delay components appearing in consecutive
rows according to the order of calculations. Besides all the
logic primitives included in the analysis of the hardware
complexity, we have also included some fanout character-
istics (expressed by the number of inputs driven by one
signal), because of the following reasons. As gates have lim-
ited fanout, should it be exceeded, the logic synthesis tool
either builds a buffer tree which drives larger number of
gates or uses slower gates with higher output fanout. This
may result in a synthesized circuit which is actually slower
than it would indicate direct delay analysis of all compo-
nents present on the critical path (indeed, most estimations
presented in the open literature count only the number of
gate levels, regardless on fanout). To note, however, that
in all 3-moduli converters considered here, the only signals
with high fanouts are the primary inputs to the converter. To
expose the high fanout issue, we have found the maximal
fanouts on the converter input (which we understood simply
as the number of occurrences of some literals xi,j (1 ≤ i ≤
3) in bit-level manipulation expressions) and included them
in the first row of Table 2 as df (a), where a is the maximal
number of gate inputs driven by the primary input.
Prior comparing delay, notice that any significant delay
reduction seems unfeasible, because the most significant
part of the overall delay is contributed by the final adder
mod 22n − 1, whose usage could not have been avoided in
any design proposed to date. It is seen that besides fanout-
related delay, any differences between the delays of various
converters stem from the number of CSA stages on the crit-
ical path. The first (faster) group of designs includes two
versions of our converters and the converters from [12, 35]
which have only one FA stage, whereas the second (slower)
group with two FA stages includes the converters from [8,
34, 36]. In the faster group, the delay of Version 2 is com-
parable with the delay of the converter from [12], which
is due to small fanout and using similar gates for bit-level
manipulations. The second best are Version 1 and the con-
verter from [35] with the fanout equal to n in either case. In
the slower group, the fastest seems to be the converter from
[36], while the slowest is the converter from [34] which has




To obtain as much as possible realistic complexity figures, we
have performed logic synthesis of both versions of our con-
verter as well as their best known counterparts from [8, 12,
34–36]. In an attempt to produce systematic and fairly com-
parable descriptions, hardware description of all converters
was done in parametrized structural Verilog, following iden-
tical coding guidelines and similar module/sub-module lay-
out, i.e., we separated bit-level manipulations, CSAs, and
adders mod 22n − 1 (designed according to [17]), and used
hierarchy preserving feature of the logic synthesis tool.
Table 2 Delay estimations (in
[8] and [36] k = n). Ver. 1 Ver. 2 [8] [12] [34] [35] [36]
df (n)+ df (3)+ df (3)+ df (3)+ df (n + 1)+ df (n)+ df (2)+
dOR+ dNOT + dNOT +
dOR+ dNOT + dOR+ dOR+ dXOR+ dOR+
dFA+ dFA+ 2dFA+ dFA+ 2dFA+ dFA+ 2dFA+
dm(2n) dm(2n) dm(2n) dm(2n) dm(2n) dm(2n) dm(2n)
df (a)—the delay of the fanout tree of size a
dm(2n)—the delay of an adder mod 22n − 1
J Sign Process Syst
We have performed logic synthesis of all converters
for a range of delay targets using Cadence RC Com-
piler (v. 8.10-s222 1) over the commercial CMOS065LP
65 nm low-power library from ST Microelectronics (CORE/
CLOCK65LPSVT, v. 5.2.2). As is customary, to obtain
more realistic delay and power estimations, we have added
input and output registers to all designs. For each design,
the minimum delay was found, which we understood as the
smallest delay target for which the logic synthesis tool still
reported a non-negative timing slack. Next, we have per-
formed physical synthesis on rectangular die whose size was
selected to obtain the density about 75 % (we have achieved
the values ranging from 70 to 80 %). Physical synthesis was
performed using Cadence Encounter (v. 10.12-s181 1) and
NanoRoute (v. 10.12-s010) tools. We have used the same
scripts for synthesis of all converters, without any specific
changes nor optimization.
The results of the place and route for all dynamic ranges
(DR) from 8 to 38 bits, namely power- and area consump-
tion at the minimal delay are visualized respectively in
Figs. 5 and 6, and the minimal delay itself is shown in
Fig. 4). It can be also observed that the results of place
and route timing in general follow the estimations given by
the logic synthesis tool alone, which seems to be the result
of a relatively simple design size wherein the impact of
placement and/or routing on the final timing is limited.
6.2 Delay Evaluation
Figure 4 reveals the logarithmic growth of the minimal
delay with the dynamic range for all converters considered,
which is the direct outcome of the logarithmic depth of the







Figure 4 Minimum delay as a function of the dynamic range. (Note:
For DR = 8 and 11, the delay of the converters of [12, 35] is the same
as for new Version 2).
contributor). It is also seen that delay differences remain
fairly constant for all converters. The synthesis results con-
firm that our converters are the fastest for all dynamic ranges
considered, which could be attributed to a smaller number
of potentially critical paths in the final addition (our con-
verters involve about two and a half of operands compared
to three complete operands in those of [12, 35]), which in
turn enabled the logic synthesis tool to use larger and faster
gates (because they are fewer) having smaller overall impact
on the final power and area.
While the delays of both Versions 1 and 2 of our con-
verters are largely the same for most of the dynamic ranges
larger than 8, a noticeable exception are the dynamic ranges
of 35 and 38 bits: a likely explanation of larger delay of
Version 1 than of Version 2 seems to be larger fanout in
the former, resulting in a deeper buffer tree (or slower gates
driving large fanout). The special cases of the converters of
[8, 36] involve the addition of four operands resulting in an
additional CSA stage contributing about 100 ps extra delay.
The slowest converter amongst ASIC-specific designs is the
converter from [34], in which further 50 ps is added by one
XOR gate (see the fifth column of Table 2).
6.3 Power Consumption Evaluation
Figure 5 shows the power consumption of all designs con-
sidered. Power consumption presented was obtained as a
sum of dynamic and leakage power from the simulation
using 1000 random vectors using PrimeTime PX v2012.1H
on the netlists provided by the physical synthesis tool. While
the minimal delay obtained was relatively easy to explain,
as it was strictly related to the number of logic levels
and hence it was easily traceable by evaluating the critical
paths, the power consumption is the result of the simulation
Po
W
Figure 5 Power at minimal delay.
J Sign Process Syst
and internal estimations performed by the power simula-
tion tool. Due to this, the power consumption estimations
contain more nonlinearities and their contributing factors
are more difficult to explain exclusively on the basis of the
internal architecture.
In general, our converters have smaller power consump-
tion (accompanied by smaller minimal delay) than all their
counterparts. The power consumption of Version 1 is, on
average, smaller than its counterpart of Version 2. Larger
power consumption of the converters from [8] than those
from [36], which both have very similar design and intro-
duce nearly the same minimal delay, requires additional
consideration. A likely reason is the presence of the addi-
tional OR gate on the critical path in [8] (compared to
uniformly distributed NOT gates in [36]) that puts additional
strain on all paths originating from this OR gate, as the logic
synthesis tool must to scale up all gates on these paths which
results in higher power consumption. Finally, it is noticeable
that the converters from [34] enjoy relatively small power
consumption which, unfortunately, is accompanied by the
largest delay.
6.4 Area Evaluation
Figure 6 shows the area of all synthesized converters,
obtained as the sum of reported areas occupied by logic
cells and estimated area of interconnections. As these fig-
ures strictly depend on the numbers of cells used, they are
more exact and consistent than power estimations.
The area occupied by all presented converters in general
follows a slightly faster than linear trend resulting from the
composition of O(n log n) complexity of the final adder and
linear complexities of other components (cf. Table 1), so no
Figure 6 Area at minimal delay.
significant differences are observed between various con-
verters. A slight vertical shift from the linear trend of the
area observed for nearly all converters between 23 and 26
bits results from the depth of the parallel prefix final adder
[17] (as n changes from 8 to 9).
In general, the area of our converters is smaller than of all
their counterparts for all dynamic ranges considered. We can
observe also that delay, power, and area are somewhat corre-
lated in our converters. For instance, for the dynamic ranges
of 14, 23, and 29 bits, we observe higher delay of Version
2 accompanied by nearly the same power consumption and
compensated by smaller area (smaller gates consume more
power despite higher delay). Also, Version 2 of our con-
verter occupies larger area for larger dynamic ranges (from
32 bits), which results from smaller delay achieved (despite
that Version 1 uses a number of OR gates not used in Ver-
sion 2). It seems that larger area of the converters from [12,
35] results from using 2n full-adders (Table 1 reveals that in
our converters, there are roughly n full-adders and n half-
adders). The area of the converters from [8, 34, 36] remains
close to all other converters but, as it was observed before,
it is accompanied by higher delay (thus the relatively small
area results from the use of smaller and consequently slower
gates).
6.5 Summary
Table 3 summarizes the improvement, in percentage terms,
of our converters with respect to their best existing coun-
terparts of [12, 35] as a function of the dynamic range: for
a given parameter, the best of our two versions and of [12,
35] is compared. The delay of the new converters is either
the same as in the previous designs or it is only slightly
reduced, up to 2.34 %. It is not surprising, because in all
converters considered the final adder mod 22n−1 (preceded
by at least two CSA stages) is the main delay contributor.
Table 3 Percentage assessment of improvements obtained.
DR Delay Area Power ADP PDP
8 1.69 15.41 11.72 16.84 13.22
11 0.00 6.77 17.83 6.77 17.83
14 0.00 10.84 21.84 10.14 21.23
17 0.78 4.95 10.37 6.42 11.76
20 0.78 10.13 25.37 9.52 25.95
23 2.34 14.07 15.41 13.40 14.75
26 0.72 10.83 19.73 11.23 19.73
29 0.00 19.95 18.03 18.50 19.18
32 0.72 10.71 11.71 11.35 12.97
35 2.16 16.56 27.01 14.76 25.96
38 0.72 8.39 16.69 9.67 17.85
Average 0.90 11.69 17.79 11.69 18.22
J Sign Process Syst
Consequently, replacing one CSA stage with one stage of
logic gates performing simple bit manipulations can only
result in a relatively small delay reduction (0.9 % on aver-
age). Nevertheless, the area is reduced from about 5 % to
about 20 % (on average by 11.69 %), and it is accompanied
by even larger savings in the power consumption—from
over 10 % up to 27 % (on average by about 17.79 %).
Consequently, due to negligible delay reduction (if any),
the reductions obtained for the area-delay and power-delay
products are highly correlated with those observed for the
area and power consumption: they range for the area-delay
product from over 6 % up to over 18.5 % (on average by
11.69 %) and for the power-delay product from about 13 %
up to about 26 % (on average by 18.2 %).
7 Conclusion
In this paper, a new general approach to improving char-
acteristics of reverse (residue-to-binary) converters for a
Residue Number System (RNS) composed of arbitrary
moduli sets was suggested. The main idea is to formulate the
basic equation of the reverse converter in a form consisting
of two separate parts: one depending on input variables of
the reverse converter whereas the other is a single constant.
Such a separation allows to reduce the number of operands
added inside the reverse converter by one, because the con-
stant, instead of being added inside the reverse converter,
can be shifted out to the residue datapath channels. We have
argued that adding the constant in the residue datapath chan-
nels, in most cases can be done at no hardware cost or extra
delay, so that applying this design approach can lead to
overall power and area reduction and in some cases also to
decreasing delay. Some suggestions which facilitate obtain-
ing the reverse converter equation with separated constants
were also given.
To illustrate various design issues of this new design
approach and to prove its efficiency, a new design method
of the residue-to-binary (reverse) converters for the popular
classic 3-moduli set {2n − 1, 2n, 2n + 1} was considered.
Investigations of different bit manipulations of the con-
verter’s input operands resulted in its two new versions.
Unlike any of previous designs, the new sets of equations
contain separated constants, which can be shifted out from
the converter to the datapath channels at no cost, thus reduc-
ing the cost of the tree of carry-save adders (CSAs) of the
converter. Experimental results suggest that compared to all
of the state-of-the-art converters for this 3-moduli set, the
converters obtained using the newly proposed approach are
superior with respect to the area and power consumption
which are reduced on average by about 12 % and 18 %,
respectively, while delay is the same or slightly smaller. As
several larger multi-moduli RNSs have been proposed by
extending the set {2n − 1, 2n, 2n + 1}, for which reverse
converters have been constructed using the converter for
the classic 3-moduli set as a basic building block, all of
the latter converters (including those that will be proposed
in the future) could also enjoy better performance, once
their basic building block is improved. Future research will
include considering the possibility of formulating equations
of reverse converters for other moduli sets, in which variable
and constant parts could be separated, which would allow
to add constants within the residue datapath channels and
would likely result in more efficient converters.
Open Access This article is distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://
creativecommons.org/licenses/by/4.0/), which permits unrestricted
use, distribution, and reproduction in any medium, provided you give
appropriate credit to the original author(s) and the source, provide a
link to the Creative Commons license, and indicate if changes were
made.
References
1. Ananda Mohan, P.V., & Premkumar, A.B. (2007). RNS-to-binary
converters for two four-moduli sets {2n − 1, 2n, 2n + 1, 2n+1 − 1}
and {2n − 1, 2n, 2n + 1, 2n+1 + 1}. IEEE Transactions on Circuits
and Systems I: Regular Papers, 54(6), 1245–1254.
2. Ananda Mohan, P.V., Premkumar, A.B., & Bhardwaj, M. (2001).
Comments on “Breaking the 2n-bit carry-propagation barrier in
residue to binary conversion for the [2n−1, 2n, 2n+1] moduli set”
and Author’s Reply. IEEE Transactions on Circuits and Systems
I: Regular Papers, 48(8), 1031.
3. Bernardson, P. (1985). Fast memoryless, over 64 bits, residue-
to-binary convertor. IEEE Transactions on Circuits and Systems,
CAS-32(3), 298–300.
4. Bhardwaj, M., Premkumar, A.B., & Srikanthan, T. (1998). Break-
ing the 2n-bit carry propagation barrier in residue to binary con-
version for the [2n −1, 2n, 2n +1] modula set. IEEE Transactions
on Circuits and Systems I: Fundamental Theory and Applications,
45(9), 998–1002.
5. Bi, G., & Jones, E.V. (1988). Fast conversion between binary and
residue numbers. Electronics Letters, 24(19), 1195–1197.
6. Cao, B., Chang, C.H., & Srikanthan, T. (2007). A residue-to-
binary converter for a new five-moduli set. IEEE Transactions on
Circuits and Systems I: Regular Papers, 54(5), 1041–1049.
7. Cao, B., Srikanthan, T., & Chang, C.H. (2005). Efficient reverse
converters for the four-moduli sets {2n, 2n − 1, 2n + 1, 2n+1 − 1}
and {2n, 2n − 1, 2n + 1, 2n−1 − 1}. IEE Proceedings - Computers
and Digital Techniques, 152(5), 687–696.
8. Chaves, R., & Sousa, L. (2007). Improving residue number system
multiplication with more balanced moduli sets and enhanced mod-
ular arithmetic structures. IEE Journal on Computers and Digital
Techniques, 1(5), 472–480.
9. Conway, R., & Nelson, J. (1999). Fast converter for 3 moduli RNS
using new property of CRT. IEEE Transactions on Computers,
48(8), 852–860.
10. Conway, R., & Nelson, J. (2004). Improved RNS FIR filter archi-
tectures. IEEE Transactions on Circuits and Systems II: Express
Briefs, 51(1), 26–28.
11. Dhurkadas, A. (1990). Comments on “An efficient residue
to binary converter design” by K. M. Ibrahim and S. N.
J Sign Process Syst
Saloum. IEEE Transactions on Circuits and Systems, 37(6), 849–
850.
12. Dhurkadas, A. (1998). Comments on “A high speed realization of
a residue to binary number system converter”. IEEE Transactions
on Circuits and Systems II: Analog and Digital Signal Processing,
45(3), 446–447.
13. Ibrahim, K.M., & Saloum, S.N. (1988). An efficient residue
to binary converter design. IEEE Transactions on Circuits and
Systems, 35(9), 1156–1158.
14. Jenkins, W.K. (1978). Techniques for residue-to-analog conver-
sion for residue-encoded digital filters. IEEE Transactions on
Circuits and Systems, CAS-25(7), 555–562.
15. Liu, Y., & Lai, E.M.K. (2004). Moduli set selection and cost esti-
mation for RNS-based FIR filter and filter bank design. Design
Automation for Embedded Systems, 9(2), 123–139.
16. Omondi, A., & Premkumar, B. (2007). Residue number systems:
theory and implementation. London: Imperial College Press.
17. Patel, R., & Boussakta, S. (2007). Fast parallel-prefix architectures
for modulo 2n − 1 addition with a single representation of zero.
IEEE Transactions on Computers, 56(11), 1484–1492.
18. Patronik, P., Berezowski, K., Biernat, J., Piestrak, S.J., & Shri-
vastava, A. (2012). Design of an RNS reverse converter for a
new five-moduli special set. In Proceedings on ACM Great Lakes
symposium VLSI (GLSVLSI) (pp. 67–70). Salt Lake City.
19. Patronik, P., & Piestrak, S.J. (2014). Design of reverse converters
for general RNS moduli sets {2k, 2n − 1, 2n + 1, 2n−1 − 1} and
{2k, 2n − 1, 2n + 1, 2n+1 − 1} (n even). IEEE Transactions on
Circuits and Systems I: Regular Papers, 61(6), 1687–1700.
20. Patronik, P., & Piestrak, S.J. (2014). Design of reverse convert-
ers for the new RNS moduli set {2n + 1, 2n − 1, 2n, 2n−1 + 1}
(n odd). IEEE Transactions on Circuits and Systems I: Regular
Papers, 61(12), 3436–3449.
21. Piestrak, S.J. (1994). Design of residue generators and multi-
operand modular adders using carry-save adders. IEEE Transac-
tions on Computers, 43(1), 68–77.
22. Piestrak, S.J. (1995). A high-speed realization of a residue to
binary number system converter. IEEETransactions on Circuits and
SystemsII: Analog and Digital Signal Processing, 42(10), 661–663.
23. Piestrak, S.J. (2011). Design of multi-residue generators using
shared logic. In IEEE international symposium on circuits and
systems (ISCAS) (pp. 1435–1438). Rio de Janeiro.
24. Piestrak, S.J., & Berezowski, K.S. (2008). Architecture of efficient
RNS-based digital signal processor with very low-level pipelin-
ing. In IET Irish signals and systems conference (pp. 127–132).
Galway.
25. Piestrak, S.J., & Berezowski, K.S. (2008). Design of residue
multipliers-accumulators using periodicity. In IET Irish signals
and systems conference (pp. 380–385). Galway.
26. Skavantzos, A. (1998). Efficient residue to weighted converter
for a new residue number system. In Proc. IEEE Great Lakes
symposium VLSI (GLSVLSI) (pp. 185–191). Lafayette.
27. Skavantzos, A., Abdallah, M., & Stouraitis, T. (2007). Large
dynamic range RNS systems and their residue to binary converters.
Journal of Circuits, Systems and Computers, 16(2), 267–286.
28. Smitha, K.G., & Vinod, A.P. (2012). A reconfigurable channel
filter for software defined radio using RNS. Journal of Signal
Processing Systems, 67(3), 229–237.
29. Sousa, L., & Anta˜o, S. (2012). MRC-based RNS reverse convert-
ers for the four-moduli sets {2n + 1, 2n − 1, 2n, 22n+1 − 1} and
{2n + 1, 2n − 1, 22n, 22n+1 − 1}. IEEE Transactions on Circuits
and Systems II: Express Briefs, 59(4), 244–248.
30. Sousa, L., Anta˜o, S., & Chaves, R. (2013). On the design of
RNS reverse converters for the four-moduli set {2n + 1, 2n −
1, 2n, 2n+1 + 1}. IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems, 21(10), 1945–1949.
31. Sweidan, A., & Hiasat, A. (1988). A new efficient memoryless
residue to binary converter. IEEE Transactions on Circuits and
Systems, 35(11), 1441–1444.
32. Taylor, F.J., & Ramnarayan, A. (1981). An efficient residue-to-
decimal converter. IEEE Transactions on Circuits and Systems,
CAS-28(12), 1164–1169.
33. Wang, Y. (2000). Residue-to-binary converters based on new Chi-
nese Remainder Theorem. IEEE Transactions on Circuits and
Systems II, 47(3), 197–205.
34. Wang, Y., Song, X., Aboulhamid, M., & Shen, H. (2002). Adder
based residue to binary number converters for (2n −1, 2n, 2n +1).
IEEE Transactions on Signal Processing, 50(7), 1772–1779.
35. Wang, Z., Jullien, G.A., & Miller, W.C. (2000). An improved
residue-to-binary converter. IEEE Transactions on Circuits and
Systems I: Fundamental Theory and Applications, 47(9), 1437–
1440.
36. Wey, C.L. (2006). Residue-to-binary converters for high-speed
digital signal processing. In Proceedings on IEEE international
conference on electro-information technology (pp. 421–426). East
Lansings.
37. Zimmerman, R. (1999). Efficient VLSI implementation of modulo
2n ± 1 addition and multiplication. In Proceedings on the IEEE
symposium on computer arithmetic (pp. 158–167). Adelaide.
Piotr Patronik received the
MSc and PhD degrees both
in computer engineering from
the Wroclaw University of
Technology in 2001 and 2006,
respectively. He is currently an
Assistant Professor at the Fac-
ulty of Electronics of the Wro-
claw University of Technol-
ogy, Wroclaw, Poland. Dur-
ing academic year 2012–2013
he was on leave at the Insti-
tut Jean Lamour, Universite´
de Lorraine, Nancy, France,
where he participated in the
ARDyT project funded by the
ANR. His research interests include design of VLSI digital circuits,
computer arithmetic, fault-tolerant computing, and parallel computing.
Stanisław J. Piestrak is a
Professor at the Institut Jean
Lamour/Universite´ de Lor-
raine, Nancy/Metz, France.
He received the Ph.D. and
Habilitation degrees both in
computer science in 1982 and
1996, respectively from the
Wroclaw University of Tech-
nology and Gdansk University
of Technology, in Poland.
His research interests include
design of VLSI digital cir-
cuits, fault-tolerant computing
(self-checking circuits design,
coding theory, and reconfig-
urable systems), and computer arithmetic (design of hardware for
appications using residue number system (RNS)).
