Hardware realization of residue number system algorithms by Boolean
  functions minimization by Gorodecky, Danila & Villa, Tiziano
Hardware realization of residue number system
algorithms by Boolean functions minimization
Danila Gorodecky1 and Tiziano Villa2
1 National Academy of Science of Belarus, Minsk, Belarus
danila.gorodecky@gmail.com
2 University of Verona, Verona, Italy
tiziano.villa@univr.it
Abstract. Residue number systems (RNS) represent numbers by their
remainders modulo a set of relatively prime numbers. This paper pro-
poses an efficient hardware implementation of modular multiplication
and of the modulo function (X(mod P )), based on Boolean minimiza-
tion. We report experiments showing a performance advantage up to 30
times for our approach vs. the results obtained by state-of-art industrial
tools.
Keywords: modular multiplication · modulo function · residue number
system · computer arithmetic · Boolean minimization.
1 Introduction
The idea of the Residue Number System (RNS) goes back to an ancient Chinese
source showing how to convert residues into numbers, and was later formalized
by C.F. Gauss in the 19th century. Since the advent of digital computers, there
have been many papers proposing algorithms to implement efficiently RNS on
computers.
The main advantage of RNS is the speed and reliability of arithmetic compu-
tations [1,2,3]. The first application of RNS was in the search of prime numbers.
Nowadays implementations of RNS can be found in anti-aircraft systems [4],
neural computations [1], real-time signal processing (pattern recognition) [5],
cryptography [6]. Modular arithmetic (MA) is effective for processing large data
flows (with several hundreds or thousands bits) [7].
RNS is a form of parallel data processing, where computer arithmetic is per-
formed using the residues of the division by a pre-selected base of co-primes
moduli {p1, p2, ..., pm}. The residues have a lower number of digits than the
original numbers and arithmetic operations over the residues can be performed
separately for each modulo of the base, resulting in faster processing (e.g., faster
addition and multiplication), compared to other forms of parallel data process-
ing.
Data processing in modular arithmetic includes the following steps. Firstly,
input operands A1, A2, . . . , An are converted from positional to modular repre-
sentations computing the remainders (or residues) with respect to the moduli
ar
X
iv
:1
80
8.
03
08
3v
1 
 [c
s.A
R]
  9
 A
ug
 20
18
2 D. Gorodecky and T.Villa
{p1, p2, . . . , pm} (see left block in Fig. 1); then arithmetic operations over the
residues of the operands for each modulo {pi}, where i = 1, . . . , n, are computed
(middle block in Fig. 1); finally, the results S1, S2, ..., Sm for each modulo are
converted back from modular to positional representations S (see right block in
Fig. 1).
Fig. 1: Common structure of RNS.
Conversion into modular representation (direct conversion) is realized by the
modulo X(mod P ) function, whose result is fed into the second step of oper-
ations. The second step of the RNS computation requires performing modular
summation, multiplication, and other arithmetic functions such A ·B + C. The
third step in RNS computes the polynomial form S1 · C1 + S2 · C2 + ... + Sm ·
Cm − P · r, where S1, S2, ... are outputs of the previous step, C1, C2, ... are pre-
calculated constants, r is a constant which is obtained during the computation
of the polynomial, and P = p1 · p2 · ... · pm. In other words the third step in
RNS computes (S1 · C1 + S2 · C2 + ...+ Sm · Cm)(mod P ). Therefore, the main
arithmetic operations needed for RNS computations are the modulo function
X(mod P ), modular summation, and modular multiplication.
A major limitation when processing large numbers in RNS is the com-
plexity of hardware realization of converters (left and right blocks in Fig. 1).
This is due to the fact that to compute the modulo function and to recover
the positional representation one should perform division, modular multiplica-
tion, and comparison. There are different approaches to solve this problem (e.g.
[1,3,8]), but, mostly, they are restricted with respect to the modular values (e.g.,
mod 2k − 1,mod 2k,mod 2k + 1) and to the number of operands.
In this paper we describe algorithms for the modulo function (X(mod P ))
and for modular multiplication. We report experimental results comparing with
industrial tools (Synopsys and Mentor Graphics).
Hardware realization of arithmetic operations 3
2 Hardware design of arithmetic units
The approach that we propose is characterized as follows:
1. It is valid for an arbitrary modulo and bit range of the inputs.
2. It can be applied to modular multiplication and to the modulo function.
3. It is based on combinational logic.
In the literature we can find techniques to compute the modular multiplication
[9] and the modulo function [8,9], but they are based on memory usage and
require a big area with high latency. In the proposed procedures, there are some
common tasks:
1. Inputs (input factors A · B in multiplication or input X in X(mod P )) are
split into subvectors.
2. All subvectors are combined to define a polynomial.
3. This procedure is iterated as long as the result > 2 · P .
2.1 Computation of the modulo function
Modulus function X(mod P ) can be computed by means of combinational or
sequential circuits.
Some sequential realizations store pre-calculated values of modulus function
by [13,14], computed by using an automaton model [15], or resort to pipelining
using a chain of homogeneous arithmetic blocks [10], where every term corre-
sponds to an arithmetic block in hardware:
X = P ·Q+R
= P · 2δ · qδ + P · 2δ−1 · qδ−1 + . . .+ P · 20 · q0 +R
and X(mod P ) = R, where X = (xψ, xψ−1, . . . , x1) and δ is defined by the
inequality P ·2δ+1 < 2ψ−1 ≤ P ·2δ. Notice that P can be an arbitrary number.
Approaches with no memory that are efficient with respect to performance
and area require special moduli sets [9], which consist of variations of 2s ± v,
where v = 1, 3, 5: {2s − 1, 2s, 2s + 1}, {22·s − 1, 2s, 22·s + 1}, {2s − 1, 22·s, 2s +
1, 2s−1 − 1, 2s+1 − 1}, etc.
Given that in the RNS representation the moduli must be co-prime num-
bers, multiplication of two 1000-bit numbers using the moduli {2s− 1, 22·s, 2s +
1, 2s−1 − 1, 2s+1 − 1} requires s ≈ 400 bits, which impairs the computational
efficiency of the transformation. The same multiplication can be realized using
a set of smaller moduli, since there are more than 400 up to 12-bit numbers
that are co-prime. Note that in order to represent numbers in RNS uniquely
the result of the calculation must not exceed P = p1 · p2 · ... · pm. If P =
(2s − 1) · (22·s) · (2s + 1) · (2s−1 − 1) · (2s+1 − 1), then s takes approximately
400-bit number.
We propose the following two-step procedure to compute X(mod P ):
4 D. Gorodecky and T.Villa
1. X is split into k subvectors with ≤ δ bits in every subvector, where δ =
[log2P − 1].
2. The resulting subvectors are combined according to Eq 1:
X(mod P ) =
k∑
i=1
Xi ·
(
2δ·(i−1)(mod P )
)
. (1)
This formula can be applied recursively producing reduced intermediate results
at every step. The coefficient 2δ·(i−1)(mod P ) is a constant and it does not
exceed P − 1. At the first step, it holds that Xi = 2δ − 1, since Eq. 1 achieves
the maximum value. Then Eq. 1 is called recursively until the result is ≤ 2 · P .
At the end, the result is compared with P and, if needed, P is subtracted from
the result of the last step.
For illustration, consider the following example. Suppose that X is an 18-bit
input and P = 47. Then modulo P is a 6-bit number, and the input X is split
into three 6-bit tuples X = (X3, X2, X1), where X1 = (x6, x5, . . . , x1), X2 =
(x12, x11, . . . , x7), and X3 = (x18, x17, . . . , x13). Then 2
6(mod 47) = 17(mod 47)
and 212(mod 47) = 7(mod 47). Hence, in the first iteration Eq. 1 takes the
following form:
X(mod 47) = X1 +X2 · 26(mod 47) +X3 · 212(mod 47) =
= X1 +X2 · 17(mod 47) +X3 · 7(mod 47) = S1
If input X = 218 − 1, then its binary representation requires 18 bits, i.e., X1 =
X2 = X3 = 6310 = 1111112. Then S1 ≤ 63 + 63 · 17 + 63 · 7 = 157510 =
110001001112. In this case Eq. 1 takes the following form:
S1(mod 47) = S
1
1 + S
1
2 · 26(mod 47) =
= S11 + S
1
2 · 17(mod 47) = S2 ≤ 447.
If S11 = 10011102 and S
1
2 = 110002, it follows S2 = 447. The second iteration
splits the 9-bit S2 number into two 6-bit and 3-bit tuples: S2 =
(
S22 , S
2
1
)
, where
S22 =
(
s29, s
2
8, s
2
7
)
and S21 =
(
s26, s
2
5, ..., s
2
1
)
. In this case Eq. 1 takes the following
form:
S2(mod 47) = S
2
1 + S
2
2 · 17(mod 47) = S3 ≤ 148.
If S21 = 1111112 and S
2
2 = 1012, it follows S3 = 148. The third iteration splits
the 8-bit number S3 into two 6-bit and 2-bit tuples: S3 =
(
S32 , S
3
1
)
, where S32 =(
s38, s
3
7
)
and S31 =
(
s36, s
3
5, ..., s
3
1
)
. In this case Eq. 1 takes the following form:
S3(mod 47) = S
3
1 + S
3
2 · 17(mod 47) = S4 ≤ 54.
If S31 = 0101002 and S
3
2 = 102, it follows S4 = 54. Since S4 < 2 · P = 94,
S4 is compared with P = 47: if S4 > 47, then X(mod 47) = S4 − 47, else
X(mod 47) = S4.
Hardware realization of arithmetic operations 5
2.2 Computation of the modular product
We propose the following two-step procedure to compute the product A · B =
R(mod P ), where A = (Aδ, Aδ−1, ..., A1), B = (Bδ, Bδ−1, ..., B1), and the δ-
subvectors Aδ and Bδ consist of the most significant bits. For example, if A and B
are 12-bit numbers and δ = 4 , then A4 = (a12, a11, a10) and B4 = (b12, b11, b10),
where a12 and b12 are the most significant bits.
This contribution proposes a modulus function computation for an arbitrary
modulo without limitation on the value of P . The idea of the approach is to
use a large set of small moduli vs. a small set of large moduli, as it is used
traditionally. Hence we consider that A,B and P vary from 6 to 12 bits.
1. The inputs are split into 2-, 3- and 4-bit subvectors.
2. The corresponding pairs of subvectors are multiplied applying the following
recursive formula:
R =
δ∑
i=1
δ∑
j=1
Ai ·Bj ·
(
2m·(i+j−2)·3(mod P )
)
= S temp, (2)
The maximum value of S temp does not exceed 23·δ+2, 23·δ+3 or 23·δ+4 depending
on value of modulo P .
As an illustration, consider three common cases:
1. δ = 2, then S temp ≤ 28 and S temp2 = S temp[3 : 1] + S temp[6 : 4] ·
23(mod P ) + S temp[8 : 7] · 26(mod P );
2. δ = 3, then S temp ≤ 212 and S temp2 = S temp[3 : 1] + S temp[6 :
4] · 23(mod P ) + S temp[9 : 7] · 26(mod P ) + S temp[12 : 10] · 29(mod P );
3. δ = 4, then S temp ≤ 212 and S temp2 = S temp[3 : 1] + S temp[6 :
4] · 23(mod P ) + S temp[9 : 7] · 26(mod P ) + S temp[12 : 10] · 29(mod P ) +
S temp[15 : 13] · 212(mod P ).
Finally, if S temp2 > P , then S = S temp2 − P , otherwise S = S temp2.
Let us multiply the two 6-bits numbers A·B = S(mod 47). Splitting operands
into two, i.e., δ = 2, 3-bits subvectors, Eq. 2 is transformed in the following form:
A·B = S(mod 47) = A1 ·B1(mod 47)+A1 ·B2 ·23(mod 47)+A2 ·B1 ·23(mod 47)+
A2 ·B2 · 26(mod 47) = S temp.
When A = 45 and B = 15, S temp achieves the maximum value, which
is 15810 = 100111102: A1 = 1012, A2 = 1012, B1 = 1112, B2 = 12, hence
A ·B = 5 · 7(mod 47) + 5 · 1 · 23(mod 47) + 5 · 7 · 23(mod 47) + 5 · 1 · 26(mod 47) =
35(mod 47) + 40(mod 47) + 45(mod 47) + 38(mod 47) = 158. Trying another
value for A and B, it is S temp < 158.
The second iteration reduces S temp to a value < 47. Assume that S temp =
158, then S temp2 = 6 + 3 · 23(mod 47) + 2 · 26(mod 47) = 6 + 24 + 34 = 64.
Finally, taking into account that 64 > 47, the result is S = 64 − 47 = 17.
Note that the bit range of S temp is preselected.
6 D. Gorodecky and T.Villa
3 Boolean minimization in modular operations
The result of any arithmetic computation can be represented as sum-of-products
(SOPs). However the original representation given by truth tables may be un-
manageable by synthesis tools, e.g., the truth table of the product of two 16-bit
input operands requires 64 columns (16 columns for each operand and 32 columns
for the result) and more than four billions rows.
For a pair of δ-bit tuples, consider 2i(mod P ) Xi ·
(
2δ·(i−1)(mod P ), where i =
1, 2, . . . , k, are the corresponding factors of the multiplication. Then 2i(mod P )
is a constant whose bits are redundant in the minimization, because all rows in
the truth table corresponding to this constant have the same value 2i(mod P ).
The initial truth table for X(mod P ) consists of P rows and 2 · δ columns,
where the left δ columns correspond to all integers from 0 up to P − 1, and the
right columns correspond to X · 2i(mod P ).
Example Consider 28(mod 13) = 9(mod 13) = 10012. In this case, subtable
1 represents the truth table for X · 9(mod 13) before minimization and subtable
2 represents the SOP after minimization (it can obtained by tools like [11] or
ELS [12]). So the first four bits in the last row of the truth table in subtable 1
represent 1210, and the right four bits represent 12 · 9(mod 13) = 410. For the
18-bit input X and P = 47, all pairs of corresponding factors are represented
as a SOP: with 12 columns (6 inputs and 6 outputs) X2 · 17(mod 47) and X3 ·
7(mod 47); with 11 columns (5 inputs and 6 outputs) X12 · 17(mod 47); with 9
columns (3 inputs and 6 outputs) X22 ·17(mod 47); with 8 columns (2 inputs and
6 outputs) X32 · 17(mod 47).
Table 1: Representation of X(mod 13) with SOPs
a2a1b2b1 r4r3r2r1 a2a1b2b1 r4r3r2r1
0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
0 0 0 0 1 0 0 1 1 0 1 - 1 0 0 0
0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0
0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0
0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0
0 1 0 1 0 1 1 0 - 0 1 0 0 1 0 0
0 1 1 0 0 0 1 0 1 - 0 0 0 1 0 0
0 1 1 1 1 0 1 1 1 0 0 - 0 0 1 0
1 0 0 0 0 1 1 1 0 1 - - 0 0 1 0
1 0 0 1 0 0 1 1 - 0 0 1 0 0 0 1
1 0 1 0 1 1 0 0 1 0 0 - 0 0 0 1
1 0 1 1 1 0 0 0 0 - 1 1 0 0 0 1
1 1 0 0 0 1 0 0 0 0 1 - 0 0 0 1
Subtable 1 Subtable 2
Hardware realization of arithmetic operations 7
4 Experimental results
We compared our procedure with respect to three electronic design automation
(EDA) tools: Synopsys, Mentor Graphics (for standard cells), and Xilinx (for
FPGAs). Since Mentor Graphics and Xilinx do not synthesize modular opera-
tions, we compared with special moduli, such 2s−1, 2s+ 1. Our approach shows
minor gains within 10%.
Synopsys is the only EDA tool which generates X(mod P ) circuits. We report
results of synthesis using Synopsys 2014 on 28 nm Standard Cell ASIC technol-
ogy from United Microelectronics Corporation. Plots 2 a) and 2 b) compare the
latency of circuits of X(mod P ) (in MHz) for inputs X of 400 and 500 bits, and
for moduli P of 10, 11, and 12 bits, respectively. Plots 3 a) and 3 b) compare the
area of circuits of X(mod P ) (number of cells from the library cells) for inputs
X of 400 and 500 bits, and for moduli P of 10, 11, and 12 bits, respectively.
Fig. 2: Modulo function: performance comparison of our approach vs. Synopsys
461 977 2011 4051
0
200
400
600 543
495
28 27 0
574
636 625 617
a) Bit range [400:1]
F
re
q
u
en
cy
,
M
H
z
Synopsys
Approach
461 977 2011 4051
0
200
400
600 523
487
21 20 0
552
621 613 584
b) Bit range [500:1]
F
re
q
u
en
cy
,
M
H
z
Synopsys
Approach
The experiments show significant gains by our approach compared with Syn-
opsys. The gain in performance is up to 30 times and in area is up to 15 times.
Moreover, Synopsys could not synthesize circuits for inputs X larger than 500
bits: the synthesis by Synopsys of the modulo function for a 600-bit input X
failed after nine days, whereas it takes only 20 minutes with our approach.
5 Conclusions and further research
Performance of computer arithmetic is one of the main advantages of RNS vs.
traditional approaches. We proposed a technique that improves significantly area
and performance of RNS with respect to synthesis using standard EDA tools.
8 D. Gorodecky and T.Villa
Fig. 3: Modulo function: area comparison (cells) of our approach vs. Synopsys
461 977 2011 4051
0
2
4
6
8
·104
90,821 92,106
25,946 27,655
0
5,759 6,509 7,581 7,168
a) Bit range [400:1]
A
re
a
,
ce
ll
s
Synopsys
Approach
461 977 2011 4051
0
0.5
1
·105
1.09 · 105
1.2 · 105
32,154 35,370
0
7,804 7,665 8,931 7,926
b) Bit range [500:1]
A
re
a
,
ce
ll
s
Synopsys
Approach
Our approach is not limited to modular multiplication and to the modulo
function, but it can be extended to any arithmetic operation. Dozens of cir-
cuits were designed with the technique presented here and then embedded in
arithmetic units by the hi-tech factory Integral (Minsk, Belarus).
Topics of further research include:
1. Comparing different forms of representations (SOPs, Reed-Muller expan-
sions, binary decision diagrams) in the realization of partial products;
2. Designing FPGAs using Xilinx and Altera architectures.
References
1. N.I Cherviakov et al., ”Modular Structures of Parallel Computing Systems for
Neuroprocessors”, Moscow, Russia, 2003, 288 p. (in Russian).
2. L. Sousa and R. Chaves, ”A Universal Architecture for Designing Efficient Modulo
2n± 1 Multipliers”, IEEE Transactions on Circuits and Systems, 2005, Vol. 52, 6,
p. 1166-1178.
3. R. Zimmermann, ”Efficient VLSI Implementation of Modulo 2n + 1 Addition and
Multiplication”, 14th IEEE Symposium on Computer Arithmetic, Adelaide, Aus-
tralia, Apr. 14-16, 1999, p. 158-167.
4. B.M. Malashevich, ” Unknown Modular Supercomputers,” Proceedings of Confer-
ence for 50 years of modular arithmetic, Nov. 23-25, 2005, pp. 50-70 (in Russian).
5. H. Flatt, S. Hesselbarth, S. Flugel, and P. Pirsch, ”A Modular Coprocessor Archi-
tecture for Embedded Real-Time Image and Video Signal Processing”, Embedded
Computer Systems: Architectures, Modeling, and Simulation, 7th International
Workshop, 2007, Samos, Greece, Proceedings, p. 241-250.
6. E. Ozturk, B. Sunar, E. Savas, ”Low-Power Elliptic Curve Cryptography Using
Scaled Modular Arithmetic”, Proceedings of the 6th International Workshop Cryp-
tographic Hardware in Embedded Systems, Cambridge, MA, USA, Aug. 11-13,
2004, Vol. 3156, p. 92-106.
Hardware realization of arithmetic operations 9
7. P.L. Montgomery, ”Modular Multiplication without Trial Division Mathematics
of Computation”, Mathematics of Computation, Vol. 44, No. 170. Apr., 1985, p.
519-521.
8. ”Computers, software, engineering and digital devices”, Ed. R.C. Dorf. Taylor and
Francis, 2006, 576 p.
9. A.R. Omondi and B. Premkumar, Residue Number System: Theory and Imple-
mentation, Imperial College Press, 2007.
10. J.T. Butler and T. Sasao, ”Fast hardware computation of x mod z”, 25th IEEE In-
ternational Parallel and Distributed Processing Symposium Anchorage, Ak, USA,
May 16-17, 2011, p. 289-292.
11. https://embedded.eecs.berkeley.edu/pubs/downloads/espresso/index.htm
12. P. Bibilo, L. Cheremisinova, S. Kardash, N. Kirienko, V. Romanov, and D.
Cheremisinov, ”Automatizations of the logic synthesis of CMOS circuits with low
power consumption,” Programnaia ingeniria, 2013, Vol.8, pp. 35-41 (in Russian).
13. P.V.A. Mohan, ”Residue Number System. Theory and applications”, Springer In-
ternational Publishing, 2016, 351 p.
14. V.P. Irhin, ”Tabular implementation of modular arithmetic operations” //
Sb.nauch.tr. YUbilejnoj Mezhdunarodnoj nauchno-tekhnicheskoj konferencii 50 let
modulyarnoj arifmetiki, 2005. pp. 268-273 (in Russian).
15. M.A. Will and Ryan K. L. Ko, ”Computing Mod Without Mod”
