In this paper, a new class of Hierarchical Residue Number Systems (HRNSs) is proposed, where the numbers are represented as a set of residues modulo factors of 2 k ± 1 and modulo 2 k . The converters between the proposed HRNS and the positional binary number system can be built as 2-level structures using efficient circuits designed for the RNS
Introduction
In Residue Number Systems (RNSs), an integer X is represented as a set of residues X i modulo given coprime moduli M i (Soderstrand et al., 1986) . The moduli set is called the system base and the residues are called residue digits. The dynamic range of the system, i.e., the number of possible number representations, is defined as the product of all RNS moduli. On the residue representation, addition, subtraction and multiplication can be done without carry propagation between residue digits.
Since residue digits are usually much smaller than the represented number X, arithmetic circuits for individual digits are smaller and faster than circuits for the binary representation of X. This property allows building arithmetic units for large numbers as a set of small and fast circuits (Mohan, 2002) . The residue arithmetic also allows significant reduction of power consumption, especially in multipliers (Chokshi et al., 2009 ) and multipliersaccumulators (Piestrak and Berezowski, 2008a; 2008b) . However, the implementation of the other arithmetic operations (e.g., division, sign detection, number comparison, overflow detection) is complex in the RNS and should be avoided. Additionally, the connection of residue arithmetic channels with the rest of a digital system has to be performed using converters between the RNS and the binary positional number system.
RNSs are especially useful in algorithms where mul-
Positional binary input Positional binary output converter Forward
Reverse converter
|+, −, ×| M1

|+, −, ×| Mn
|+, −, ×| M2
Residues Arithmetic channels tiplications and additions dominate, e.g., in digital signal and image processing (Wnuk, 2008) , where RNS multiply-and-accumulate circuits can be used (Piestrak and Berezowski, 2008b ). An RNS-based digital image processing application is proposed by Wang et al. (2004) while RNS-based Finite Impulse Response (FIR) filters are described by Conway and Nelson (2004) as well as Piestrak and Berezowski (2008b) . An interesting review of RNS potential and applications is presented by Mohan (2002) and Soderstrand et al. (1986) . The general scheme of an RNS circuit is shown in Fig. 1 . The circuit consists of three main parts: the forward converter, the residue arithmetic channels and the reverse
174
T. Tomczak converter. The forward converter translates binary input numbers into the set of residue digits. Next, for the residue representation, additions, subtractions and multiplications are executed using independent circuits performing calculations modulo appropriate moduli. The final part of the circuit is the reverse converter, which computes the positional binary representation from the residue representation obtained after the computations have been done using arithmetic channels modulo M i . The converters introduce hardware overhead, thus the use of the RNS is justified only if savings in residue channels outweigh the conversion cost. It should also be noted that all results are computed modulo the RNS dynamic range and there is no simple method to detect overflow.
One of the main problems in RNSs is the selection of the system base. The RNS base should be chosen for each application individually to get a suitable dynamic range, circuit speed and complexity. For the required dynamic range, a trade-off between the speed of arithmetic units and conversion complexity has to be found. Small moduli allow building small and fast arithmetic units, but the number of moduli in the RNS base and, consequently, the complexity of converters grow. On the other hand, arithmetic operations for large moduli could be costly. The combination of small moduli, a large dynamic range and simple converters is possible in Hierarchical Residue Number Systems (HRNSs) .
In HRNSs, the values of all or only some residue digits are represented in residue systems with dynamic ranges smaller than the range of the main system (Akusskij and Judickij, 1968; Yassine, 1992; Skavantzos and Abdallah, 1999) . The system used to represent residue digits is called the lower level RNS, whereas the system whose digits are represented in the lower level RNS is called the higher level RNS. The digits of the lower level RNS can also be represented in the next RNS leading to a multilevel HRNS. The highest system in the hierarchy is called the top level RNS, and the lowest system is called the bottom level RNS. The dynamic ranges of lower level RNSs can be chosen in two ways. In the first approach (Akusskij and Judickij, 1968; Yassine, 1992) , the dynamic ranges of the lower level RNS are large enough to represent intermediate results. For example, if multiplication is to be done, then the lower level RNS range has to be equal to at least the higher level modulus square. The advantage of this solution is that the same moduli can be used for the representation of different residue digits. As an example, consider the top level RNS (17, 19, 20, 21) and single multiplication as an arithmetic operation to compute. Since the maximum values of the residue digits are 16, 18, 19 and 20, the dynamic ranges for the bottom level RNS have to be at least 16 2 + 1 = 257, 18 2 + 1 = 325, 19 2 + 1 = 362 and 20 2 + 1 = 401. Thus, the RNS (3, 4, 5, 7) with the dynamic range 420 can be used for all four digits. It is then possible to build an HRNS with the range 17·19·20·21 > 2 17 with 3-bit moduli. Unfortunately, the main disadvantage of this solution is fast growth of lower level RNS dynamic ranges. Additionally, the converters between consecutive levels have to be used after a small number of arithmetic operations-in the above example, after one multiplication.
In the second approach (Skavantzos and Abdallah, 1999) , the top level RNS base is chosen from numbers factorisable into small factors. The lower level RNS bases are then built from factors for corresponding moduli. In this method, the range of the lower level RNS is equal to moduli from the higher level base. Thus, performing calculations in the lower level RNS is identical with performing them modulo higher level moduli. The conversion between consecutive levels can be done once for all arithmetic operations, which results in low hardware overhead. However, the main disadvantage of this idea is the difficulty with finding the top level RNS base. In the work of Skavantzos and Abdallah (1999) , this base is chosen from moduli 2 2k − 1 and the bottom level RNSs are (2 k − 1, 2 k + 1). This allows implementing converters and arithmetic units as simple structures based on Eqns. (12) and (13) presented in the next section. However, since all the moduli in the RNS base have to be coprime, in the HRNS by Skavantzos and Abdallah (1999) the moduli width difference can be large. Moreover, some moduli values can be close to the system dynamic range, so that any advantages are lost.
In this paper, a new method for HRNS base construction is proposed. The base includes the moduli 2 k ± 1 factorisable into small divisors. This approach allows building input/output converters as two-level circuits, as shown in Fig. 2 . In the top level RNS, the conversion between large numbers (close to the system range) and the residues modulo 2 k ± 1 is done. In the bottom level RNS, transformations between residues modulo 2 k ±1 and residues modulo factors of 2 k ± 1 are performed. We shall show that, due to an efficient implementation of operations modulo 2 k ±1 for large numbers, the area and critical path delay of the proposed two-level converters are small. Additionally, arithmetic operations are performed modulo small moduli, and thus adders and multipliers can be implemented as small and fast circuits. This paper is organized as follows. Section 2 offers theoretical background including basic definitions, reverse conversion algorithms and methods of efficient implementation of residue operations using the periodicity property of the series of powers of 2 taken modulo M i . Section 3 contains the proposition of a new HRNS class with the analysis of the multiplier's complexity and detailed formulas describing the conversion between the proposed HRNS and the positional binary number system. Section 4 presents detailed information about the implementation of reverse converters for the proposed HRNS and the com-
175
X X
Converter 2
Arithmetic channels Factors conv.
Factors conv. parison of converters hardware complexity to other RNSs. Section 5 summarizes the paper and offers conclusions.
RNS basics
The RNS base is a set of n positive, co-prime moduli
According to the Chinese remainder theorem (Soderstrand et al., 1986) , any integer X ∈ [A, A + M ) for any A can be uniquely represented in the RNS as an ntuple (X 1 , X 2 , . . . , X n ). The residue digits X i ∈ [0, M i ) are remainders of the division of X by M i . This is denoted by
Two integers X and Y which have the same residue when divided by a specific moduli M i are called congruent modulo M i . This is denoted by
Addition, subtraction and multiplication can be executed for each modulus M i independently,
The result W of operations defined by (4) is the computed modulo M , but there is no simple way to detect overflow. The dynamic range of the RNS can be extended by adding additional moduli to the base, but, unlike in weighted number systems, the values of additional digits cannot be easily obtained during computations. Accordingly, the dynamic range of the RNS must be large enough to represent even the largest result of computations.
Periodicity property.
The periodicity property of the series of powers of 2 taken modulo M i presented by Piestrak (1994) can be used for efficient implementation of residue arithmetic circuits and forward and reverse converters. The periodicity property results from Euler's theorem (Biernat, 2007) , which states that for coprime, positive integers X and M ,
where ϕ(M ) is the totient function. Thus, the residues for consecutive powers of X repeat with a period equal to at most ϕ(M ). Let 2 j , 2 k , j < k denote two different powers of 2 and the distance between 2 j and 2 k be defined as k − j. According to Piestrak (1994) , the period P (M i ) of the odd modulus M i is defined as the minimum distance between two different powers of 2, for which residues modulo M i are equal, i.e.,
The half-period HP(M i ) of the odd modulus M i is the minimum distance between two different powers of 2, for which residues modulo M i are additive inverses, i.e., the residues of a binary number X modulo 2 k ± 1 are defined as
HP(M
and
According to Eqns. (12) and (13), the residues |X| 2 k ±1 can be computed as a sum or a difference modulo 2 k ± 1 of k-bit fields B a . The addition can be done with multi-operand modulo adders (MOMAs) built using carry-save adders (CSAs) with end-around carry (EAC). Detailed design guidelines and analysis of a periodicity based MOMA for various moduli are presented by Piestrak (1994) .
The idea of the use of EAC is shown in Fig. 3 on the basis of a ripple-carry adder (RCA) built from full-adder cells (FA). Since the residue modulo 2 k − 1 for the k-th bit of the output sum is 1. Thus the k-th bit can be added in the least significant position of the adder. Notice that the adder shown in Fig. 3 can produce the result equal to 2 k − 1 instead of 0, e.g., when adding 2 k − 2 and 1. The adder is then called an adder with double zero representation. When the double zero representation is undesirable, additional circuits are required or other structures of adders modulo 2 k − 1 should be used, e.g., parallel prefix adders (PPAs) (Biernat, 2007) . The VHDL library of residue adders and multipliers using periodicity property is presented by Zimmermann (1998).
Reverse conversion.
Reverse conversion algorithms are based on the Chinese Remainder Theorem (CRT) or Mixed Radix Conversion (MRC). In the classical CRT converters, the value of X is computed as
In the MRC converters (Soderstrand et al., 1986) , X is defined as
where
Recently, an algorithm called' the new Chinese remainder theorem II, has been developed (Wang, 2000) . For a system defined by two moduli (M 1 , M 2 ), the reverse conversion can be performed according to the equation
The only requirement for M 1 and M 2 is being relatively prime. Converters based on (18) can also be used in the RNS with the base consisting of any number of pairwise prime moduli. The converters are then built as multi-level structures (Fig. 4) . On each level the moduli are grouped in pairs and for each pair Eqn. (18) is applied. For each pair the new residue modulo product of the moduli from the pair is computed. The computed residues are then the input to the next level, where they are grouped in new pairs and the whole process is repeated. On the last level two moduli roughly equal to √ M are used. Since in (18) only one modulo operation performed on the difference X 1 − X 2 is required and the other operations (multiplication by M 2 and addition of X 2 ) are done using positional arithmetic, this approach could bring meaningful simplification compared to the CRT, where a multi-operand addition modulo product of all moduli is needed.
Besides the above general algorithms, there are also known reverse converters optimized for specific moduli sets (Cao et al., 2003; 2007; Molahosseini et al., 2010; Wang et al. 2000; 2003; ) . Among many different moduli sets the set (2 k − 1, 2 k , 2 k + 1) was often investigated and many efficient reverse converters are developed (Piestrak, 1995; Wang et al., 2000; Mohan, 2001; Bi et al., 2004) .
New HRNS class
In the proposed HRNS, the top level base is chosen from moduli 2 k ± 1 factorisable into small numbers. Then, the factors for moduli 2 k ± 1 are the bases for the bottom level RNSs. The ranges of the bottom level RNSs are then equal to appropriate moduli 2 k ± 1. The number of possible systems is limited by the number of factorisable numbers 2 k ± 1. The number 2 k − 1 is not prime, if k is not prime (Biernat, 2007) , since
There are also non-prime numbers 2 k − 1 for prime k, e.g., 2
11 − 1 = 23 · 89. The numbers 2 k + 1 are not prime, if k is neither prime nor a power of two (Biernat, 2007) , since, for any odd j,
The condition that 2 k ±1 is not prime is not sufficient to build an efficient HRNS, since many non-prime numbers have large factors. As an example, consider 2 28 +1 = 17 · 15790321. The arithmetic units (e.g., multipliers) modulo 15790321 can be larger and slower than modulo 2 28 + 1. The limit of the factor size for efficient implementations highly depends on arithmetic units structures and implementation technology. In Table 1 there are some factorisable numbers 2 k ± 1 which can be used as a base for searching the required HRNS.
Many different HRNS classes can be built using moduli from Table 1 . For large dynamic ranges the moduli with small factors can be combined, i.e., 2 50 − 1, 2 84 − 1 , 7, 13, 17, 19, 27, 37, 73, 109, 241, 433 , 38737 , 9, 13, 29, 43, 49, 113, 127, 337, 1429, 5419 , 14449 , 11, 31, 41, 101, 125, 251, 601, 1801, 8101, 4051 , 268501 , 9, 103, 307, 2143, 2857, 6529, 11119, 43691, 131071 or 2 102 − 1. In this paper, only HRNSs with the top level base (2 k − 1, 2 k , 2 k + 1) are investigated. This approach allows using very efficient converter structures designed for the RNS (2 k − 1, 2 k , 2 k + 1). Moreover, since one of the moduli is 2 k , arithmetic operations and residue generators for those moduli are very simple (Piestrak, 1994) . One disadvantage of this solution is a dynamic range limit at 90 bits, because for k > 30 it is difficult to find a pair 2 k ± 1 factorizable into small numbers. The proposed HRNS has two levels. Arithmetic computations are performed at the bottom level RNS with the base consisting of factors of 2 k ± 1 and 2 k . The converters can be built as hierarchical structures, where large number computations are done with converters for the RNS
. Additionally, the proposed HRNS allows performing some difficult operations (e.g., sign detection (Tomczak, 2008)) after partial conversion to the RNS (2 k − 1, 2 k , 2 k + 1). Thus, the proposed HRNS class allows using small moduli in arithmetic channels and simple, efficient converters developed for the RNS Next, the residues modulo 2 30 − 1 are written using the RNS (7, 9, 11, 31, 151, 331) , whereas the residues modulo 2 30 + 1 are written using the RNS (13, 25, 41, 61, 1321) . Thus, the representations of the numbers X and Y in the bottom level RNS are X = (3, 3, 4, 15, 75, 164, 1073741823, 8, 15, 15, 24, 788) and Y = (2, 2, 3, 14, 74, 163, 1073741822, 7, 14, 14, 23, 787) . Next, the value of W = X · Y is computed in the bottom level RNS according to Eqn. (4) as 6, 6, 1, 24, 114, 252, 2, 4, 10, 5, 3, 607) .
which is a representation of the required result 309485009821292292166647810.
Multiplier's complexity.
The reduction of the moduli width usually results in smaller, faster and less power hungry arithmetic circuits. In this section, the standard unit-gate model by Zimmermann (1999) is used for area estimation and comparison of multipliers in the 3-moduli RNS (2 k − 1, 2 k , 2 k + 1) and in the proposed HRNS. According to this model, each two-input monotonic gate (e.g., AND, OR, NAND, NOR) has the area A = 1, two-input XOR and XNOR gates have the area A = 2 and a 1-bit full-adder has the area A = 7. For gates with the number of inputs a > 2, the area is a − 1 times larger than that of a single, two input gate performing the same logic function.
Multiplier area comparison requires area estimation of different types of multipliers. The areas of the k-bit binary multiplier and multipliers modulo 2 k ± 1 are taken from the work of Zimmermann (1999), whereas the area of a multiplier modulo any other k-bit number is estimated for the multiplier presented by Hiasat (2000) . The area of a multiplier modulo 2 k is assumed as a half of a full kbit binary multiplier area. In all multipliers, Wallace trees with no Booth reduction are used for carry-save addition and fast parallel prefix adders are used. The formulas used for area estimation are given in Table 2 . Table 2 . Area estimation formulas for different multipliers using a unit-gate model. The formulas for multipliers modulo 2 k ± 1 are taken from the work of Zimmermann (1999), the formula for a multiplier modulo any other k-bit number is based on the results of Hiasat (2000).
Multiplier
Area The area of a multiplier in the RNS is computed as a sum of the areas occupied by individual multipliers modulo moduli from the system base. Thus, the area of the full multiplier in the RNS (2 k − 1, 2 k , 2 k + 1) is computed as a sum of areas occupied by multipliers modulo 2 k − 1, 2 k and 2 k + 1. The area of multipliers in the HRNS is computed in a similar way. For moduli which are not of the form 2 a ± 1, the multipliers from the work of Hiasat (2000) are used. The areas for RNSs with different dynamic ranges are compared in Table 3 .
It should be noted that for all moduli the fastest multipliers with prefix adders are used. In real systems some multipliers, especially for smaller moduli, could be implemented as slower and smaller circuits, because the critical path is usually determined only by the largest modulus. Thus, additional area savings could be obtained.
As shown in Table 3 , in all cases but one the multipliers in the HRNS are from 21 to 49 % smaller than the multipliers in the RNS (2 k − 1, 2 k , 2 k + 1) with the same dynamic range. The estimated area of the HRNS multiplier is larger for the dynamic range equal to 27 bits because of a general modulo multiplier for moduli 19, 27, 73. The multiplier presented by Hiasat (2000) 
and HRNS: the proposed HRNS.
of the periodicity property. The multiplier using the periodicity property could offer better parameters, because periods or half-periods for these moduli equal 9.
The complexity of the presented HRNS multipliers was also compared with that of one of the newest residue number systems offering higher parallel processing degree than the RNS (2 et al. (2007) . The results are shown in Fig 5. These moduli sets do not allow constructing systems with the same dynamic range as the proposed HRNS. Thus the comparison is done for all achievable dynamic ranges in the scope covered by the proposed HRNS. Analysing data from Fig. 5 shows that for dynamic ranges larger than 30 bits the proposed HRNS offers much smaller multipliers than for the 3-and 4-moduli RNSs, and for the dynamic ranges larger than 70 bits the multiplier area is less than for the 5-moduli RNS. Moreover, for dynamic ranges between 30 and 70 bits HRNS multipliers complexity is comparable to that of multipliers for the 5-moduli RNS, but HRNS multipliers can be used for dynamic ranges for which the 5-moduli RNS is not available. Thus, the proposed HRNS based on the 3-moduli RNS allows building multipliers with a comparable or smaller area than the 5-moduli general RNS.
Conversion from the HRNS.
In this section, equations for efficient conversion from the proposed HRNS to the binary number system are described. The conversion from a new HRNS to the binary number system is performed in two steps. In the first one, the residues modulo 2 k ± 1 are computed from the residues for appropriate factors of 2 k ± 1. In the second step, a typical converter from the RNS (2 k − 1, 2 k , 2 k + 1) is used. Since in the literature there are known many efficient converters from the RNS (2 k − 1, 2 k , 2 k + 1) (Piestrak, 1995; Wang et al., 2000; Mohan, 2001; Bi et al., 2004) , in this paper only the first step is analysed.
Because most of the reverse conversion equations proposed in this paper are based on (18), it is rewritten to allow simple and efficient implementation. Transformations are chosen to replace complex arithmetic operations (such as multiplications and residue computing) with simple logical operations on bit fields (concatenation, rotation, etc.). If complex modulo operations are difficult to implement with a simple logic, Look-Up Tables  (LUTs) based on read-only memories (ROMs) are used. In this case, the reverse conversion equations are rewritten to minimize the ROM address width.
The first operation in Eqn. (18) is the difference X 1 − X 2 . If M 1 is small, then it is desirable to compute the difference as a residue modulo M 1 or at least as a number congruent modulo M 1 to |X 1 − X 2 | M1 and less than the difference X 1 − X 2 . Therefore, the rest of the operations in |q 2,1 · (X 1 − X 2 )| M1 · M 2 can be implemented with a simple circuit due to a low operand width. This approach gives especially good results when M 2 is much larger than M 1 and P (M 1 ) or HP (M 1 ) is small. Hence, the operation |X 1 − X 2 | M1 can be replaced with
or
The operations from Eqns. (21) and (22) can be implemented efficiently with circuits based on the periodicity property. First, observe that if X 1 < 2 b and X 2 < 2 b for some b, then
where X 2 = (2 b −1)−X 2 denotes bit-by-bit complementation of X 2 binary representation. Notice that only b least
180
T. Tomczak significant bits of X 2 are complemented. Since X 1 < 2
is a concatenation of b-bit binary fields B a defined by Eqn. (9), we have
Next, the difference
is a sum of a small number of powers of 2. In this case, the multiplication modulo M 1 can be computed as a sum modulo M 1 of cyclic ro-
= 2 j and computations can be done modulo
The modulo multiplication by a multiplicative inverse is then a simple left cyclic rotation by j bits.
The last complex operation in Eqns. (18) is the multiplication by M 2 . In some of the proposed conversion equations, this multiplication is replaced with a concatenation or a sum of binary vectors representing |q 2,1 · (X 1 − X 2 )| M1 . The simplest implementation (concatenation only) is when M 2 is a sum of ±2 bj and 2 bj+1 /2 bj > M 1 for any j. The transformed reverse conversion process illustrates the following equation for computing the residue modulo 2 2k − 1 from the residues modulo 2 k ± 1:
Equation (28) Since the results of |X| 13 − |X| 5 and |X| 5 − |X| 13 are both 5-bit wide, a small LUT of size 2 5 × 6 bits can be used to compute the second operand of the main sum. In this case, both equations should result in a similar circuit complexity.
Conversion for 2
9 − 1. 2 9 − 1 has two factors: 7 and 73. Thus
There is also possible the second version with calculations modulo 73, but the circuits for computations modulo 7 are usually simpler, smaller and faster than modulo 73. 3.2.4. Conversion for 2 10 − 1. The three factors 3, 11 and 31 of 2 10 − 1 impose the two step conversion. In the first step, the residue modulo 3 · 11 = 31 is calculated according to
and in the second step, the residue modulo 2 10 − 1 is calculated from residues modulo 31 and 33 according to Eqn. (28). The computation of the residue modulo 33 can be also done with the second version of Eqn. (34) with moduli 11 and 3 swapped, but then the multiplication of the difference |X| 11 −|X| 3 by |3 −1 | 11 had to be done modulo 11, which has a much larger period P (11) = 10 and half-period HP (11) = 5 than P (3) = 2 and HP (3) = 1.
10 + 1. Since 2 10 + 1 has two factors 25 and 41, the conversion is done according to
There is also the second version with computations modulo 41, but since both 25 and 41 have large periods; thus the multiplication modulo 25 of the difference by constant 11 has to be done with an LUT. The number of LUT output bits is smaller by one bit for the result modulo 25 than modulo 41. 
All complex operations in Eqns. (40) and (41) are computed modulo 2 k ± 1. Since the binary representation of 1057 is 10000100001 2 and the residue modulo 31 occupies 5 bits, the multiplication of a residue modulo 31 by 1057 can be implemented as a concatenation of 3 residues modulo 31.
15 + 1. The three factors of 2 15 + 1 imply two-level conversion. The first step to find |X| 2 15 +1 is a calculation of the residue modulo 11 · 331 = 3641 as
Knowing |X| 3641 , the residue modulo 2 15 + 1 is given as
In both the equations calculations are performed modulo very small moduli. Additionally, in (43) the residues for large numbers can be computed efficiently using an MOMA modulo 2 3 +1. The implementation of Eqns. (42) and (43) requires also multiplications by relatively large constants, but owing to the low width of residues modulo 9 and 11, those multiplications can be implemented with small LUTs of size 2 4 × 12 bits.
18 + 1. The factors of 2 18 +1 can be grouped into pairs 5 · 13 = 65 and 37 · 109 = 4033. One of the products is equal to 2 6 + 1, whereas the other one is 2 12 − 2 6 + 2 0 . The value of |X| 2 18 +1 is computed by the two-level circuit. The first level computes residues modulo 65 according to Eqns. (29) 
Equation (45) allows using the MOMA modulo 65, which performs operations according to (27) , for computations in the right addend. Multiplication by 4033 can be implemented as a 12-bit subtraction due to a low width of residue modulo 65 and a special form of constant 4033 = 2 12 − 2 6 + 2 0 . The idea is shown in Fig. 6 . Notice that the result requires at most 18 bits, because 4033 · 65 < 2 18 . The six least significant bits of the result are equal to the bits of residue modulo 65. 
Conversion for
Multiplication by 65281 = 2 16 − 2 8 + 2 0 can be implemented as subtraction in a way similar to that presented in Fig. 6 There is also possible the use of one MOMA, which can realise subtraction and multiplication by 86 as a sum of shifted bit fields taken from |X| 62581 and |X| 257 as shown in Fig. 7 (which does not contain additional constants equal to 2 defined by (27)). All bits set to 1 shown in Fig. 7 and additional constants can be replaced by cumulative correction equal to sum modulo 257 of all constant values. For the case shown in Fig. 7 , cumulative correction (including bits set to ones) is equal to |252+248+224+128+ 1 + 254 + 3 + 252 + 15 + 240 + 63 + 192 + 12 · 2| 257 = 97.
Conversion for 2
30 + 1. The reverse conversion for 2 30 +1 is the most complex task among those presented. In the first step, the residues modulo 25 · 1321 = 33025 = 2 15 + 2 8 + 1 and 13 · 41 · 61 = 32513 = 2 15 − 2 8 + 1 are computed, whereas in the second one the residue modulo 2 30 + 1 is found. 
The residue modulo 32513 is determined by the residues for factors 13, 41 and 61. Thus the reverse conversion according to (18) requires a two-level circuit. There are two forms of Eqn. (18), which result in a similar complexity. In the first one, the residue modulo 41 · 61 = 2501 is computed, whereas in the second case the residue modulo 13 · 61 = 793 is found first.
The first method of reverse conversion is described by .
Equation (53) can be implemented using an MOMA which computes a sum of shifted bit vectors representing |X| 32513 and |X| 33025 . The reverse conversion based on the CRT was used because the final addition modulo 2 30 + 1 can be efficiently done with MOMA adding operands constructed in a way similar to that shown in Fig. 7 . 9  5419  21  42  89  -11  6529  51  102  97  24  48  8101  50  100  101  50  100  8191  -13  103  -51  11119  -51  109  18  36  14449  42  84  113  14  28  38737  36  72  125  50  100  43691  17  34  127  -7  65537  16  32  151  -15  131071  -17  157  26  52  268501  50  100  241 12 24
Conversion to the HRNS.
Conversion from the binary number system to the HRNS can be performed in two steps. In the first one, the residues |X| 2 k ±1 and |X| 2 k are computed, in the second-the residues modulo factors of 2 k ± 1 for |X| 2 k ±1 are calculated. The first step can be efficiently performed using the periodicity property. However, this step is omitted for operands smaller than 2 k , e.g., when the dynamic range of the HRNS allows performing multiplication and many additions without overflow. This situation is more likely for the HRNS with a small dynamic range, e.g., the HRNS constructed from the RNS (2 6 − 1, 2 6 , 2 6 + 1) allows accumulating ≈ 2 6 results of multiplication of two 6-bit operands.
The calculation of residues modulo factors of |X| 2 k ±1 can be done with any residue generator. Additionally, when the periods or half periods for different factors of 2 k ± 1 are equal (e.g., P (3) = HP (5) = 2 or P (19) = P (27) = 18), the residue generators for this factors may share the same hardware, which allows further area reduction. The periods and half periods for factors of 2 k ± 1 are listed in Table 4 .
Hardware sharing may be carried out in two cases. First, when for the numbers M 1 and 18 , 2 18 + 1). The number 2 18 − 1 has the factor of 7, and the number 2 18 + 1 has the factors 5 and 13. Since P (13) = 4 · P (7) = 3 · P (5), if X < 2 18 − 1, we can use one adder modulo 2 12 − 1, which reduces X to a 12-bit number. Then the residues modulo 5, 7 and 13 can be computed for the 12-bit number. Even if X > 2 18 − 1, the reduced 12-bit vector can be used as input for generators modulo 5 and 13, which are factors of the same modulus.
The second case when the periodicity property can simplify residue generators is for HP(M 1 ) = r · P (M 2 ). In this case, the residues modulo 2 HP (M1) + 1 and 2 r·P (M2) − 1 are computed according to Eqns. (12) and (13). First, the input number X i has to be partitioned into HP(M 1 )-bit wide fields B a . Next, three multi-operand adders should be used for the computation of a even B a , a odd B a and a odd B a . The sum a odd B a can be computed modulo 2 HP (M1) + 1. Finally, the residues |X i | 2 HP(M 1 ) +1 and |X i | 2 r·P (M 2 ) −1 can be computed according to Eqns. (12) and (13). In such implementation the computation of a even B a is common for both circuits, which results in hardware savings.
As an example consider the top level base (2 30 − 1, 2 30 , 2 30 + 1). The number 2 30 − 1 has the factors 7 and 9, the number 2 30 + 1 has the factor of 13. Now we have HP (13) = P (9) = 2 · P (7) = 6. If the converted number X has the 45-bit binary representation, then three multi-operand adders can be used. The first adder computes the sum of four 6-bit wide fields Then, |X| 7 can be computed as a sum modulo 7 of outputs from the first and the second multi-operand adder, |X| 9 can be computed as a sum modulo 9 of outputs of the same adders, and |X| 13 can be computed as a sum modulo 13 of outputs from the first and the third multioperand adder.
Implementation of reverse converters
In this section, reverse converter implementations for the proposed HRNS are presented. The circuits were described in VHDL and simulated with Cadence IUS 0611. After functional simulation the circuits were synthesized with Cadence Encounter RTL Compiler, version 07.20. The FreePDK45nm library and design flow was used (Stine et al., 2005) .
The converters proposed in this paper consist of two levels. The top level is responsible for conversion between a binary number and a representation in the RNS (2 k − 1, 2 k , 2 k + 1), the bottom level does calculations between residues modulo 2 k ± 1 and the RNS with the base consisting of factors of 2 k ±1. For converters from the bottom level RNS to residues modulo 2 k ± 1, two implementations were compared: the circuits based on equations presented in Section 3.2 and the circuits built according to the CRT. The conversion from the RNS (2 k −1, 2 k , 2 k +1) to the binary number system was performed with two different architectures: the circuits presented by Wang et al. (2002) and by Bi et al. (2004) .
The VHDL codes were written to maximize the use of optimisation algorithms built into the VHDL compiler. During an automated synthesis process, the VHDL compiler constructs the structures of inferred logic blocks (e.g., adders, multipliers, CSA trees) on the basis of parameters of supplied library cells to meet area and delay requirements. An example of the inferred circuit is a carry propagate adder (CPA). The same VHDL code results in different adder structures depending on area and time constraints. In this paper, two sets of synthesis constraints were applied. In the first case, no time constraints were imposed, therefore the circuits were synthesized to achieve the smallest area, i.e., all adders were synthesized as ripple-carry adders (RCAs). The second constraints set resulted in the fastest circuit, i.e., all adders were synthesized to achieve a minimal critical path delay. Detailed analysis of automatically generated adder structures showed that adders were constructed as hybrid structures. The part for less significant bits was a typical RCA to keep a small area, but for more significant bits fast carry propagation structures were used.
The second example of the inferred circuit is a CSA tree, which can be automatically built by RTL Compiler. The only designer task is to compose a chain of adders in the way which would allow doing many additions in parallel. Compiler generated CSAs usually have a better area and delay than structures made by hand. The main reason is that, for a hand-made CSA, it is difficult to consider parameters of individual cells from the library. Moreover, the structure of a CSA can be automatically tuned by the compiler to meet time constraints. The VHDL compiler can also merge a cascade of arithmetic operations into one CSA with all intermediary signals transformed to CSA form and with only one final CPA adder.
It should be noticed that it is difficult to write the VHDL code which can exploit compiler optimisations and at the same time creates MOMA structures identical to
186
T. Tomczak those presented by Piestrak (1994) . An example is shown in Fig. 8 , where the two methods compute the result of the operation ||X| 5 − |X| 13 | 5 used in Eqn. (29). The circuit from Fig. 8(a) is built according to Piestrak (1994) and implemented in structural VHDL by instantiations of fulladder cells. The method from Fig. 8(b) is implemented in VHDL as two adders using the + operator. The first adder computes the 4-bit sum, whereas the second one adds 2-bit fields from the computed sum. Due to different structures, the circuits differ with cumulative correction, which equals |2 + 2 + 2 + 2 + 2| 5 = 0 and |2 + 2 + 2 + 2| 5 = 3 for Figs. 8(a) and (b), respectively.
To check the area and the critical path delay of circuits from Fig. 8 , they were used in converters built according to Eqn. (29). The area and the critical path delay for the converter with the circuit from Fig. 8(a) were 177.40 µm 2 and 899 ps for the smallest version and 286.27 µm 2 and 553 ps for the fastest version. The area and the critical path delay for the converter with the circuit from Fig. 8(b) were 169.89 µm 2 and 858 ps for the smallest version and 347.28 µm 2 and 548 ps for the fastest version. Thus, the version from Fig. 8(b) allows building smaller and faster circuits and additionally leaves more place for compiler optimisations. In this work, all MOMAs are described in the way shown in Fig. 8(b) .
In residue arithmetic circuits there often occur complex computations, e.g., residue computations for moduli, which have large periods. One of the most widely known methods for the implementation of complex arithmetic operations on small arguments is the use of ROM-based LUTs. Proper implementation of ROM requires designing layout masks by hand or using automated generators supplied with a standard cell library. Unfortunately, for many free standard cell libraries there are no such generators. Moreover, after designing the layout mask, a simulation is necessary to find the delay and power of the created memory. Simulated parameters are then used to approximate full circuit characteristics. This process is complex and time consuming, especially when many ROMs are needed.
The reverse converters compared in this work require many LUTs. To avoid a long time development process, in this comparison ROMs are described in a high level language (VHDL) as combinatorial circuits consisting of a row decoder, a column decoder and a connection matrix. VHDL codes for ROMs are generated automatically based on ROM data. The area and time of generated circuits are much better than for strictly combinatorial implementations of the ROM. The test synthesis of the ROM consisting of 2 10 × 6 bits gives the following results: the pure combinatorial implementation has the area 4010 µm 2 and the delay 1.8 ns (20 logic levels), the implementation with a row and column decoder results in the area 1079 µm 2 and delay 0.769 ns (6 logic levels). The above generated ROMs are used in both converters based on the proposed equations and converters according to the CRT.
Conversion circuits.
For the proposed HRNS class, it is difficult to give a general reverse converter structure, because for each k the factors of 2 k ± 1 can be grouped in many ways. Therefore, in this section, synthesized reverse converter circuits are individually described.
Despite the circuit differences, in many of the proposed conversion circuits there is one common operation: the calculation of the residue modulo 2 2k −1 from residues modulo 2 k ± 1. The circuit used in this paper implements Eqn. (28), as shown in Fig. 9 . Its main advantage is the lack of multipliers, which are replaced with a left cyclic rotation and a concatenation. The first operation of the circuit from Fig. 9 is the calculation of |X| 2 k +1 2 k −1 . Since |X| 2 k +1 ≤ 2 k , the residue modulo 2 k − 1 can be computed by OR-ing the least and the most significant bits of |X| 2 k +1 , because for a number less than 2 k + 1 only one of these bits can be equal to 1. Next, the subtraction is replaced with the addition of the additive inverse of |X| 2 k +1 2 k −1 modulo 2 k − 1, which is nothing else but
and can be computed as the bit by bit complementation of |X| 2 k +1 2 k −1 . If the output of the adder modulo 2 k −1 at the top of Fig. 9 is without double zero representation, the rest of the circuit can be very simple, because the remaining multiplications from Eqn. (28) can be implemented with simple bit operations as shown in the middle of Fig.  9 . Thus, depending on the required area and speed, the adder can be implemented as RCA with EAC and an additional double-zero elimination circuit (Biernat, 2007) or as a parallel prefix adder, e.g., the one presented by Zimmermann (1999) . The final operation is the addition of the number |X| 2 k +1 to
implemented with the adder at the bottom of Fig. 9 . The final adder can be realised with any architecture (e.g., as a parallel prefix adder), but in Fig. 9 the RCA is used to show that the most significant k − 1 cells of the adder are used only for carry propagation. Thus, the total area of the final adder can be lowered comparing to a full 2k-bit adder.
The circuits based on the architecture shown in Fig. 9 are used for the reverse conversion for moduli 2 6 −1, 2 12 − 1, 2 18 − 1, 2 24 − 1 and for 2 30 − 1. Converters for other moduli will be described below.
The converter for 2 6 + 1 is based on Eqn. (30). First, the two 2-bit fields from residue |X| 5 and two 2-bit fields of residue |X| 13 are added using an MOMA modulo 2 6 + 1 of Fig. 8(b) . The constant, cumulative correction (here equal to 3) required in the MOMA is not added in this step to reduce the result width. The 3-bit output from the MOMA is next connected with an input of a combinatorial circuit synthesized using the truth table. The realized function is the addition of correction required in the MOMA and all necessary computations to obtain |2 · (|X| 5 − |X| 13 )| 5 · 13. The last step is the addition of |X| 13 to get the final result of |X| 2 6 +1 .
The converter for 2 9 − 1 is built according to Eqn. (31). First, the difference |X| 7 − |X| 73 is computed using an MOMA as a 4-bit number congruent modulo 7 to |X| 7 − |X| 73 . The rest of the operations from |5 · (|X| 7 − |X| 73 )| 7 · 73 are computed using the 4-input combinatorial circuit, whose output is then added to |X| 73 to calculate |X| 511 = |X| 73 + |5 · (|X| 7 − |X| 73 )| 7 · 73.
The converter for 2 9 + 1 is built according to Eqn. (32). Since both P (19) = 18 and HP (19) = 9 are large compared with an operand width (5 bits), the converter consists of a subtractor computing |X| 19 − |X| 27 , 2 6 × 5 bit LUT for the rest of operations in |12 · (|X| 19 − |X| 27 )| 19 , the final multiplier by 27 and an 8-bit adder.
The converter for 2 10 − 1 is built according to Eqns. (34) and (28). First, the difference ||X| 3 − |X| 11 | 3 is computed using MOMA modulo 2 2 − 1 adding 2-bit wide fields of |X| 3 and |X| 11 . The 3-bit wide MOMA output is then connected to inputs of a small (3-input, 5-output) combinatorial circuit performing the rest of operations from (34) except the addition of |X| 11 . The combinatorial circuit is synthesized using a truth table. The residue modulo 33 is obtained after adding |X| 11 to the output of the combinatorial circuit. Next, the circuit from Fig. 9 is used to compute the final residue modulo 2 10 − 1 from |X| 31 and |X| 33 .
The converter modulo 2 10 + 1 is built according to Eqn. (35). First, the difference |X| 25 − |X| 41 is computed as a 7-bit U2 number. Next, the rest of operations from |11 · (|X| 25 − |X| 41 )| 25 are computed using a 2 7 × 5 bit ROM. The final residue modulo 2 10 + 1 is obtained after the multiplication of ROM output by constant 41 and the addition of |X| 41 .
The converter for 2 12 + 1 is built according to Eqn. (36). First, the value of |6 · (|X| 17 − |X| 241 )| 17 is computed using an MOMA modulo 17. Next, a multiplier by constant 241 is used, and the final addition of multiplier output and |X| 241 is done using a 13-bit adder.
The converter for 2 14 − 1 is built according to Eqns. (37) and (28). First, the difference ||X| 3 − |X| 43 | 3 is computed using an MOMA modulo 2 2 − 1. Then, it is multiplied by constant 43, and |X| 3 is added to get |X| 129 . The residue modulo 2 14 − 1 is finally computed from |X| 127 and |X| 129 using the circuit of Fig. 9 .
The converter for 2 14 + 1 is built according to Eqns. (38) and (39). Equation (38) is implemented as a three-level circuit. First, the difference |X| 29 − |X| 113 are computed as an 8-bit U2 number. Next, the rest of calculations in |19 · (|X| 29 − |X| 113 )| 29 is performed using a 2 8 × 5 bit ROM. The ROM output is then multiplied by constant 113 and |X| 113 is added to the multiplier output to get |X| 3277 . After obtaining |X| 3277 , the circuit based on (39) is used. First, it computes ||X| 5 − |X| 3277 | 5 using an MOMA consisting of two parts. The first part performs computations modulo 2 4 −1, the second part reduces 5-bit output from the first part to 3-bit number modulo 2 2 + 1 by adding 2-bit wide fields of the 5-bit output. No necessary corrections resulting from (24) are built into the CSA tree forming an MOMA, thus the resulting a 3-bit vector is next encoded using 3-input 3-output combinatorial circuit, which is synthesized using the truth table. Finally, the combinatorial circuit output is multiplied by constant 3277, and |X| 3277 is added to the multiplier output to get the residue modulo 2 14 + 1. The converter for 2 15 − 1 is built according to Eqns. (40) and (41). First, the difference |X| 7 − |X| 151 is computed using an MOMA as a 4-bit number congruent modulo 7 to |X| 7 − |X| 151 . Next, multiplication by 2 modulo 7 and the multiplication by 151 are done using a 4-input combinatorial circuit synthesized according to a suitable truth table. The residue |X| 1057 is obtained after the addition of |X| 151 to the output of the combinatorial circuit using a 10-bit adder. In the next stage, the difference |X| 31 − |X| 1057 is computed using an MOMA modulo 31 as a 7-bit number congruent modulo 31 to |X| 31 − |X| 1057 . The following multiplication by 21 modulo 31 is calculated using the 7-bit input 5-bit output ROM. Since 1057 = 10000100001 2 , the multiplication of |21 · (|X| 31 − |X| 1057 )| 31 by 1057 is implemented as a concatenation of three ROM outputs. The final addition of the concatenation result and |X| 1057 is done with a 15-bit adder, which generates the unbiased result.
The converter for 2 15 + 1 is based on Eqns. (42) and (43). The difference ||X| 11 − |X| 331 | 11 is computed using an MOMA as a 6-bit number congruent modulo 2 HP (11) + 1 = 2 5 + 1 to |X| 11 − |X| 331 . The 6-bit wide MOMA output feeds then the ROM, which calculates the rest of operations from |2 · (|X| 11 − |X| 331 )| 11 · 331. The residue |X| 3641 is obtained after the addition of 12-bit ROM output (the result of the multiplication of a residue modulo 11 by 331) and |X| 331 . Next, an MOMA modulo 9 is used to calculate a 4-bit number congruent to ||X| 9 − |X| 3641 | 9 . Then, the 4-bit MOMA output feeds a combinatorial circuit which calculates the value of |2 · (|X| 9 − |X| 3641 )| 9 · 3641. The final result is obtained after the addition of |X| 3641 using a 15-bit adder.
The converter for 2 18 + 1 is built according to Eqns. (30), (44) and (45). For the residue modulo 2 6 + 1, the circuit described before is used. The residue modulo 37 · 109 = 4033 is computed by the circuit based on (44). The first level computes the difference |X| 37 − |X| 109 as an 8-bit U2 vector using a 7-bit binary subtractor. Next, the difference is partitioned into two nibbles, and two 2 4 ·6 bit ROMs are used to multiply low and high nibbles by 18 modulo 37. The multiplication results are then added in an adder modulo 37, the result of addition is multiplied by 109 and, finally, |X| 109 is added to get |X| 4033 . The obtained residue modulo 4033 is then subtracted from |X| 65 using a CSA that implements Eqn. (27). Next, a second CSA is used to multiply the result by 22 modulo 65. The multiplication result is then multiplied by 4033. The final residue modulo 2 18 + 1 is obtained after the addition of |X| 4033 to the result of the multiplication by 4033.
The converter for 2 24 + 1 is based on Eqns. (46) 
and for fast HRNS-the fastest and small HRNS-the smallest implemented converters for the proposed HRNS.
(47). First, the difference |X| 97 − |X| 673 is computed as an 11-bit U2 vector. Next, the values of two 4-bit fields containing eight most significant bits of the vector are multiplied by 16 modulo 97 with two 4-input ROMs, one ROM for each field. For the rest of the vector (three least significant bits), the multiplication by 16 modulo 97 is done using a 3-input combinatorial circuit. The three results of multiplication are then added using an MOMA modulo 97. Next, the result of the modulo addition is multiplied by 673, and |X| 673 is added to get |X| 65281 . Having obtained |X| 65281 , the residue modulo |X| 2 24 +1 is computed according to Eqn. (47). First, the difference |X| 257 − |X| 65281 is computed from (27) as a 10-bit vector congruent modulo 257 to ||X| 257 − |X| 65281 | 257 . The result is then multiplied by 86 modulo 257 by adding rotated and/or complemented bit fields in a CSA tree built on the basis of the periodicity property. The CSA output represents the value of |86 · (|X| 257 − |X| 65281 )| 257 . The residue modulo 2 24 +1 is obtained after the multiplication of the CSA output by 65281 and the addition of |X| 65281 to the result of multiplication by 65281.
The converter for 2 30 + 1 is based on Eqns. (53), (49) and (48). First, the circuits that realize Eqns. (49) and (48) in parallel compute residues modulo 32513 and 33025, which are then multiplied by constants (2 29 −2 21 +2 6 +1) and (2 29 + 2 21 − 2 6 + 1) and added with an MOMA to get the residue modulo 2 30 + 1 (as shown in Eqn. (53)).
The circuit computing residue modulo 32513 is built according to Eqn. (49) . In the first step, the difference |X| 61 − |X| 41 is computed as a U2 vector. The value encoded in three most significant bits of the vector is then multiplied by 3 modulo 61 using a 3-input combinatorial circuit. The output from the circuit is added modulo 61 to the four remaining bits of the difference multiplied by 3 giving the value of |3 · (|X| 61 − |X| 41 )| 61 . |X| 2501 is obtained after the multiplication of this value by 41 and the addition of |X| 41 . The next step is to find the residue modulo 32513. First, the value of 8 · (|X| 13 − |X| 2501 ) is computed using a CSA according to Eqn. (27) . The result is an 8-bit number congruent modulo 2 HP(13)+1 = 2 6 + 1 to |8 · (|X| 13 − |X| 2501 )| 13 . Next, the four most significant bits of the result are reduced modulo 13 using a 4-input ROM, and the obtained value is added to the remaining four bits with an adder modulo 13. The adder output is multiplied by 2051, and |X| 2501 is added to the multiplication result to get the final residue modulo 32513.
The residue modulo 33025 is computed according to Eqn. (48) in the parallel channel to the circuit computing |X| 32513 . First, the difference |X| 25 −|X| 1321 is computed as a 12-bit U2 number which is partitioned into three 4-bit wide fields, which are multiplied by 6 modulo 13 using three 4-input ROMs. The ROM outputs are added with an adder modulo 25, whose output is multiplied by 1321. Finally, |X| 1321 is added to the multiplication result to get the residue modulo 33025. Table 5 . Size in standard unit-gates of converters from the RNS (2 k − 1, 2 k , 2 k + 1) and of the proposed HRNS converters. The overhead in the last two columns is defined as the difference between the number of gates required for HRNS converters and for the RNS converter. the 5-moduli RNS, although the number of moduli in the HRNS is equal to 5 for dynamic ranges less than 2 30 and larger than 5 for all RNSs with the dynamic range ≥ 2 30 . For example, the converter for the 90-bit HRNS with 12 moduli takes almost the same area as the converter for the 5-moduli RNS with the same dynamic range. This result shows that converters for the proposed HRNS have much better area characteristics than highly optimized converters for RNSs with moduli of the form 2 k and 2 k ± 1.
The fastest HRNS converters are larger than those for the 5-moduli RNS, but for many HRNSs there are no 5-moduli RNSs with similar dynamic ranges. Moreover, for some 5-moduli RNSs with a similar dynamic range the area overhead of the fastest HRNS converter can be offset by replacing 5-moduli multipliers with HRNS multipliers. For example, for the 90-bit dynamic range the fastest HRNS converter is larger by 3169 logic gates and the HRNS multiplier is smaller by 2522 logic gates. Thus, the converter area overhead can be off-set by only two multipliers.
It is worth noting that for the 90-bit dynamic range the area of the smallest converter for the proposed HRNS is similar to that of the converter for the 5-moduli RNS. Thus, with comparable conversion cost, the HRNS offers much more parallelism (12 moduli instead 5). This also proves that the proposed idea of the multi-level HRNS with a very simple top-level converter based on moduli 2 k , 2 k ± 1 results in a large dynamic range and a high degree of parallel processing, allowing high performance of the arithmetic circuit while maintaining low converter cost.
