Abstract. Residue number systems (RNS) represent numbers by their remainders modulo a set of relatively prime numbers. This paper proposes an efficient hardware implementation of modular multiplication and of the modulo function (X(mod P )), based on Boolean minimization. We report experiments showing a performance advantage up to 30 times for our approach vs. the results obtained by state-of-art industrial tools.
Introduction
The idea of the Residue Number System (RNS) goes back to an ancient Chinese source showing how to convert residues into numbers, and was later formalized by C.F. Gauss in the 19th century. Since the advent of digital computers, there have been many papers proposing algorithms to implement efficiently RNS on computers.
The main advantage of RNS is the speed and reliability of arithmetic computations [1, 2, 3] . The first application of RNS was in the search of prime numbers. Nowadays implementations of RNS can be found in anti-aircraft systems [4] , neural computations [1] , real-time signal processing (pattern recognition) [5] , cryptography [6] . Modular arithmetic (MA) is effective for processing large data flows (with several hundreds or thousands bits) [7] .
RNS is a form of parallel data processing, where computer arithmetic is performed using the residues of the division by a pre-selected base of co-primes moduli {p 1 , p 2 , ..., p m }. The residues have a lower number of digits than the original numbers and arithmetic operations over the residues can be performed separately for each modulo of the base, resulting in faster processing (e.g., faster addition and multiplication), compared to other forms of parallel data processing.
Data processing in modular arithmetic includes the following steps. Firstly, input operands A 1 , A 2 , . . . , A n are converted from positional to modular representations computing the remainders (or residues) with respect to the moduli {p 1 , p 2 , . . . , p m } (see left block in Fig. 1 ); then arithmetic operations over the residues of the operands for each modulo {p i }, where i = 1, . . . , n, are computed (middle block in Fig. 1) ; finally, the results S 1 , S 2 , ..., S m for each modulo are converted back from modular to positional representations S (see right block in Fig. 1 ). Conversion into modular representation (direct conversion) is realized by the modulo X(mod P ) function, whose result is fed into the second step of operations. The second step of the RNS computation requires performing modular summation, multiplication, and other arithmetic functions such A · B + C. The third step in RNS computes the polynomial form
.. are outputs of the previous step, C 1 , C 2 , ... are precalculated constants, r is a constant which is obtained during the computation of the polynomial, and P = p 1 · p 2 · ... · p m . In other words the third step in RNS computes (S 1 · C 1 + S 2 · C 2 + ... + S m · C m )(mod P ). Therefore, the main arithmetic operations needed for RNS computations are the modulo function X(mod P ), modular summation, and modular multiplication.
A major limitation when processing large numbers in RNS is the complexity of hardware realization of converters (left and right blocks in Fig. 1 ). This is due to the fact that to compute the modulo function and to recover the positional representation one should perform division, modular multiplication, and comparison. There are different approaches to solve this problem (e.g. [1, 3, 8] ), but, mostly, they are restricted with respect to the modular values (e.g., mod 2 k − 1, mod 2 k , mod 2 k + 1) and to the number of operands. In this paper we describe algorithms for the modulo function (X(mod P )) and for modular multiplication. We report experimental results comparing with industrial tools (Synopsys and Mentor Graphics).
Hardware design of arithmetic units
The approach that we propose is characterized as follows:
1. It is valid for an arbitrary modulo and bit range of the inputs. 2. It can be applied to modular multiplication and to the modulo function. 3. It is based on combinational logic.
In the literature we can find techniques to compute the modular multiplication [9] and the modulo function [8, 9] , but they are based on memory usage and require a big area with high latency. In the proposed procedures, there are some common tasks:
1. Inputs (input factors A · B in multiplication or input X in X(mod P )) are split into subvectors. 2. All subvectors are combined to define a polynomial. 3. This procedure is iterated as long as the result > 2 · P .
Computation of the modulo function
Modulus function X(mod P ) can be computed by means of combinational or sequential circuits.
Some sequential realizations store pre-calculated values of modulus function by [13, 14] , computed by using an automaton model [15] , or resort to pipelining using a chain of homogeneous arithmetic blocks [10] , where every term corresponds to an arithmetic block in hardware:
and X(mod P ) = R, where X = (x ψ , x ψ−1 , . . . , x 1 ) and δ is defined by the inequality P · 2 δ+1 < 2 ψ − 1 ≤ P · 2 δ . Notice that P can be an arbitrary number. Approaches with no memory that are efficient with respect to performance and area require special moduli sets [9] , which consist of variations of 2 s ± v, where v = 1, 3, 5:
Given that in the RNS representation the moduli must be co-prime numbers, multiplication of two 1000-bit numbers using the moduli {2 s − 1, 2 2·s , 2 s + 1, 2 s−1 − 1, 2 s+1 − 1} requires s ≈ 400 bits, which impairs the computational efficiency of the transformation. The same multiplication can be realized using a set of smaller moduli, since there are more than 400 up to 12-bit numbers that are co-prime. Note that in order to represent numbers in RNS uniquely the result of the calculation must not exceed
, then s takes approximately 400-bit number.
We propose the following two-step procedure to compute X(mod P ):
1. X is split into k subvectors with ≤ δ bits in every subvector, where δ = [log 2 P − 1]. 2. The resulting subvectors are combined according to Eq 1:
This formula can be applied recursively producing reduced intermediate results at every step. The coefficient 2 δ·(i−1) (mod P ) is a constant and it does not exceed P − 1. At the first step, it holds that X i = 2 δ − 1, since Eq. 1 achieves the maximum value. Then Eq. 1 is called recursively until the result is ≤ 2 · P . At the end, the result is compared with P and, if needed, P is subtracted from the result of the last step.
For illustration, consider the following example. Suppose that X is an 18-bit input and P = 47. Then modulo P is a 6-bit number, and the input X is split into three 6-bit tuples X = (X 3 , X 2 , X 1 ), where X 1 = (x 6 , x 5 , . . . , x 1 ), X 2 = (x 12 , x 11 , . . . , x 7 ), and X 3 = (x 18 , x 17 , . . . , x 13 ). Then 2 6 (mod 47) = 17(mod 47) and 2
12 (mod 47) = 7(mod 47). Hence, in the first iteration Eq. 1 takes the following form:
If input X = 2 18 − 1, then its binary representation requires 18 bits, i.e., X 1 = X 2 = X 3 = 63 10 = 111111 2 . Then S 1 ≤ 63 + 63 · 17 + 63 · 7 = 1575 10 = 11000100111 2 . In this case Eq. 1 takes the following form: S 1 (mod 47) = S 
Computation of the modular product
We propose the following two-step procedure to compute the product A · B = R(mod P ), where A = (A δ , A δ−1 , ..., A 1 ), B = (B δ , B δ−1 , ..., B 1 ) , and the δ-subvectors A δ and B δ consist of the most significant bits. For example, if A and B are 12-bit numbers and δ = 4 , then A 4 = (a 12 , a 11 , a 10 ) and B 4 = (b 12 , b 11 , b 10 ) , where a 12 and b 12 are the most significant bits.
This contribution proposes a modulus function computation for an arbitrary modulo without limitation on the value of P . The idea of the approach is to use a large set of small moduli vs. a small set of large moduli, as it is used traditionally. Hence we consider that A, B and P vary from 6 to 12 bits.
1. The inputs are split into 2-, 3-and 4-bit subvectors. 2. The corresponding pairs of subvectors are multiplied applying the following recursive formula:
The maximum value of S temp does not exceed 2 3·δ+2 , 2 3·δ+3 or 2 3·δ+4 depending on value of modulo P .
As an illustration, consider three common cases: Finally, if S temp 2 > P , then S = S temp 2 − P , otherwise S = S temp 2 . Let us multiply the two 6-bits numbers A·B = S(mod 47). Splitting operands into two, i.e., δ = 2, 3-bits subvectors, Eq. 2 is transformed in the following form:
6 (mod 47) = S temp. When A = 45 and B = 15, S temp achieves the maximum value, which is 158 10 = 10011110 2 :
(mod 47) = 35(mod 47) + 40(mod 47) + 45(mod 47) + 38(mod 47) = 158. Trying another value for A and B, it is S temp < 158.
The second iteration reduces S temp to a value < 47. Assume that S temp = 158, then S temp 2 = 6 + 3 · 2 3 (mod 47) + 2 · 2 6 (mod 47) = 6 + 24 + 34 = 64. Finally, taking into account that 64 > 47, the result is S = 64 − 47 = 17. Note that the bit range of S temp is preselected.
Boolean minimization in modular operations
The result of any arithmetic computation can be represented as sum-of-products (SOPs). However the original representation given by truth tables may be unmanageable by synthesis tools, e.g., the truth table of the product of two 16-bit input operands requires 64 columns (16 columns for each operand and 32 columns for the result) and more than four billions rows.
For a pair of δ-bit tuples, consider 2 i (mod P ) X i · 2 δ·(i−1) (mod P ), where i = 1, 2, . . . , k, are the corresponding factors of the multiplication. Then 2 i (mod P ) is a constant whose bits are redundant in the minimization, because all rows in the truth table corresponding to this constant have the same value 2 i (mod P ). The initial truth table for X(mod P ) consists of P rows and 2 · δ columns, where the left δ columns correspond to all integers from 0 up to P − 1, and the right columns correspond to X · 2 i (mod P ). Example Consider 2 8 (mod 13) = 9(mod 13) = 1001 2 . In this case, subtable 1 represents the truth table for X · 9(mod 13) before minimization and subtable 2 represents the SOP after minimization (it can obtained by tools like [11] or ELS [12] ). So the first four bits in the last row of the truth table in subtable 1 represent 12 10 , and the right four bits represent 12 · 9(mod 13) = 4 10 . For the 18-bit input X and P = 47, all pairs of corresponding factors are represented as a SOP: with 12 columns (6 inputs and 6 outputs) X 2 · 17(mod 47) and X 3 · 7(mod 47); with 11 columns (5 inputs and 6 outputs) X 
Experimental results
We compared our procedure with respect to three electronic design automation (EDA) tools: Synopsys, Mentor Graphics (for standard cells), and Xilinx (for FPGAs). Since Mentor Graphics and Xilinx do not synthesize modular operations, we compared with special moduli, such 2 s − 1, 2 s + 1. Our approach shows minor gains within 10%.
Synopsys is the only EDA tool which generates X(mod P ) circuits. We report results of synthesis using Synopsys 2014 on 28 nm Standard Cell ASIC technology from United Microelectronics Corporation. Plots 2 a) and 2 b) compare the latency of circuits of X(mod P ) (in MHz) for inputs X of 400 and 500 bits, and for moduli P of 10, 11, and 12 bits, respectively. Plots 3 a) and 3 b) compare the area of circuits of X(mod P ) (number of cells from the library cells) for inputs X of 400 and 500 bits, and for moduli P of 10, 11, and 12 bits, respectively. The experiments show significant gains by our approach compared with Synopsys. The gain in performance is up to 30 times and in area is up to 15 times. Moreover, Synopsys could not synthesize circuits for inputs X larger than 500 bits: the synthesis by Synopsys of the modulo function for a 600-bit input X failed after nine days, whereas it takes only 20 minutes with our approach.
Conclusions and further research
Performance of computer arithmetic is one of the main advantages of RNS vs. traditional approaches. We proposed a technique that improves significantly area and performance of RNS with respect to synthesis using standard EDA tools. Our approach is not limited to modular multiplication and to the modulo function, but it can be extended to any arithmetic operation. Dozens of circuits were designed with the technique presented here and then embedded in arithmetic units by the hi-tech factory Integral (Minsk, Belarus).
Topics of further research include:
