Abstract. Pipelined two-operand modular adder
Introduction
Modular addition plays an important role in the implementation of digital signal processing systems that use the residue number system [1] [2] [3] [4] as well as its derivatives like the quadratic residue number system (QRNS) [5] and modified quadratic residue number system (MQRNS) [6] for processing of complex signals. The RNS is a nonweighted integer number system that is determined by its base B={m 1 ,where  denotes addition, subtraction or multiplication.
The reverse conversion from the RNS to a weighted system can be performed using the Chinese remainder theorem (CRT) [1] , [2] or the mixed-radix system (MRS) [1] , [2] . The main advantage of the RNS comes from the fact that addition, subtraction and multiplication are carryfree and can be performed without carries between individual positions of the number. The principal advantage of the RNS with respect to the high-speed DSP is due to the replacement of large multipliers that limit the pipelining frequency, by small multipliers modulo m i . If their binary size l = (log 2 m i ), where  denotes rounding off to an integer, does not exceed six bits, multiplications by a constant can be performed by look-up with small ROMs or using combinatorial networks. General multiplications are also easier to perform because their standard realizations are small or segmentation of operands can be used for the combinatorial realization. It is worth mentioning that moduli with l < 7 may provide for the dynamic ranges over 90 bits [7] . The additional advantage of the RNS is the possibility of reducing power dissipation in CMOS circuits which is due to the lower switching activity and reduction of supply voltages [9] . The RNS has found numerous applications in the DSP, for example, in FIR filters [8] [9] [10] [11] , FFT processors [12] , digital downconversion [13] and image processing [14] , [15] .
Generally TOMAs can be divided into two main categories determined by the type of the modulus. TOMAs for moduli akin to 2 n represent the first category and those for generic moduli the other. There are several works in the literature that consider the TOMA design.
Banerji [16] presented a look-up approach, Agrawal and Rao [17] proposed a TOMA for moduli of the form (2 n + 1) based on binary adders. Soderstrand [18] introduced a hybrid approach based on look-up table along with the binary adder. Bayoumi and Jullien [19] described TOMAs using the table approach and binary adders approach. Dugdale [20] demonstrated an implementation of TOMAs that used binary adders, Piestrak [21] proposed a TOMA based on the carry-save adder (CSA) and two binary adders. Zimmermann [22] introduced modulo (2 n  1) adders based on parallel prefix-architecture (PPA). Hiasat [23] proposed a TOMA with the reduced area based on the carry-look-ahead (CLA) adder. Also a novel delay-powerarea-efficient approach to the TOMA design was given by Patel et al. [24] . Their TOMA structure was based on the cascaded connection of the modified carry-save adder (CSA) and reduced carry-propagate adder (CPA). The used CPA designs included ELM [25] , Kogge-Stone [26] and Ladner Fischer [27] PPA.
In this paper we propose a new TOMA based on a modified CLA adder. This TOMA has the smaller area than other considered TOMAs and allows to derive a new pipelined TOMA that is better than other known pipelined TOMAs in terms of the area and the number of stages of pipeline registers. We shall show the structure of the new pipelined TOMA and, for comparison, TOMAs based on the RCA, PPA in the Brent-Kung form [28] and Hiasat TOMA [23] . Comparisons are made using the data from the VLSI standard cell library. We shall compare structures of individual TOMAs in terms of area, delay and pipelining frequency with the use of the additive method. The method uses summation of areas of individual components expressed in gate equivalents (GE), where 1 GE is the area of the NAND with the fan-out = 1 for the given standard cell library. The propagation delay of an individual element is taken as the worst case delay for all possible inputs. The analysis relies upon the established 130 nm Samsung standard cell library STDH150 [29] . Calculations of areas and delays of individual components are practically technology independent and they can be scaled down for VLSI technologies such as 28 nm or 22 nm. Therefore we may therefore suppose that for comparison of individual digital structures, the assumed technology will give sufficient and dependable information. The paper has the following structure: in Sec. 2 we review the basic TOMA structures, in Sec. 3 we consider the TOMA-RCA, and in Sec. 4 Hiasat TOMA, in Sec. 5 we present the TOMA based on the PPA adder and finally in Sec. 6 a new TOMA. In each section we analyze a nonpipelined and pipelined form.
Basic TOMA Structures Based on Binary Adders
In this section we shall shortly describe the basic known TOMA structures that use exclusively binary adders in series and which therefore may be the most suitable for transformation to the pipelined form and not those that use two parallel adders as in [21] . Two-operand modular addi- We shall shortly analyze the operation of the Bayoumi-Jullien TOMA ( 
TOMA-RCA
By way of introduction we shall consider the realization of the Bayoumi-Jullien TOMA based on the RCA. In order to obtain a pipelined structure, layers of pipeline registers consisting of flip-flops (FFs) have to be inserted between individual adders as shown in Fig. 5 . In the following we shall analyze the area of the TOMA-RCA expressed in GE, the delay and the maximum attainable pipelining frequency. The area will be estimated using the areas of the individual components from STDH150, the delay for a nonpipelined structure will be evaluated by using the maximum delays for the individual components. In order to estimate the pipelining frequency a structure is divided into balanced layers with respect to the delay and the maximum pipelining frequency is obtained as the inverse of the sum of the delay of the slowest layer and the FF delay. 
A. Nonpipelined 5-bit TOMA-RCA area
The indices of the individual components come from STDH150. The data of individual components is given in Appendix A. After inserting these data into (1) 
, we obtain 81 68
B. Nonpipelined 5-bit TOMA-RCA delay
We shall estimate the delay of the structure of The delay of the 5-bit TOMA-RCA can be expressed as
The delay for 4 s and 5 c bits can be calculated as
In order to compute ' 5 c , we shall first calculate i c t and ' The area is the sum of the nonpipelined 5-bit TOMA-RCA area and the area of pipeline registers. In this case these registers require n s = 66 FFs. Thus the area can be expressed as
As A FF we shall use the area of the flip-flop FD1Q, A FD1Q from STDH150. For the structure from Fig. 5 we receive A TOMA-RCAp = 472.9 GE.
D. Pipelined 5-bit RCA-TOMA pipelining rate
In order to design a pipelined structure of a TOMA, we have to decompose its nonpipelined structure into a certain number of layers and place pipeline registers between them. The decomposition is, to certain extent, arbitrary. The lower limit of the number of layers is two and the upper limit is determined by a delay of the component that we treat as indivisible. The minimum pipelining rate is approximately the sum of the delay of the layer with the maximum delay and the delay of the pipeline register. In this case we have assumed that after each FA or HA a register layer is placed and the OR gate and the MUXs are in the same layer. Hence we may evaluate the maximum delay of the layer as 
Hiasat TOMA
In the following we shall examine the results of transforming the Hiasat TOMA which requires the smallest hardware amount among known TOMAs. This TOMA consists of the serial connection of five units: the sum-andcarry (SAC), the carry propagate and generate (CPG), CLA for c OUT , multiplexer (MUX), CLA and Summation (CLAS). The SAC is composed of HAs and HALs (the modified HAs in [23] ). The SAC performs 
The following stage, MUX selects using c OUT the carry's and generate's 
. (14) In the next step the sum bits are calculated as
First we shall determine the area for components of the Hiasat five-bit TOMA and then the area for m = 29.
A. 5-bit Hiasat TOMA area
The area of the five-bit Hiasat TOMA can be computed as follows 
. (18) In general, the area of the CFG_5 can be expressed as 
The CLAS block consists of the five-bit PropagateGenerate Unit (PGU_5), Carry-Generate Unit (CGU_5) and Summation Unit (SU_5). Its hardware amount can be estimated as
with the fan-outs 1, 3, 3, 4, 2. We get 
and GE 0 15 5
Example 2. Area of the five-bit Hiasat TOMA for m = 29.
The TCS representation of (-m) is equal to 100011, hence w = 3, and k -w = 2 (the sign bit is excluded (25) where n h is a number of flip-flops used in pipeline registers. For example, for the structure from Fig. 6 
D. Pipelining frequency of pipelined 5-bit Hiasat TOMA
In Fig. 6 , a pipelined form of the Hiasat TOMA is presented. Five pipeline register stages are used with 58 flip-flops.
In this case we have adopted a decomposition into six layers that leads to a balanced structure. In order to evaluate the maximum pipelining frequency we shall calculate delays of the adopted individual layers. The maximum pipelining frequency will depend on the delay of the layer with the maximum delay and the delay of the assumed pipeline register. These layers have the following delays: 
PPA-based TOMA
As the next structure we shall consider the TOMA based on a PPA. As the PPA the Brent-Kung (BK) [28] adder has been selected. The Brent-Kung TOMA can be relatively easy transformed to the pipelined form, moreover the use of the Brent-Kung PPA allows one to simplify the adder used in the second stage when one of addends is a constant. The prefix operator  is defined as
where
The block that implements (27a-b) will be denoted as BK i . Subsequently we shall analyze the area and delay of the TOMA based on two BK adders.
The area of the TOMA BK 
A. The area of BK adder
After transforming the logic functions used for the realization of individual adders in (29), we receive the following areas
Using (29), (30) . .
 

C. The area of BK-m adder
The form of the first layer of the BK-m adder depends on the TCS representation of -m, m . We shall analyze the prefix operator computation for a pair of bits ( i m ,
(27a-b) can be expressed as ) ( 
D. The area of the pipelined TOMA BK
This area can be evaluated as
where n BK is the number of flip-flops in pipeline registers. For the structure from Fig. 7 with n BK = 51 and A FF = A FD1Q = 5.67 GE , we get GE 84 380 
New Five-bit TOMA
In this section we shall show a new TOMA structure and its pipelined form that requires smaller area than other TOMA structures. The TOMA is configured as a serial connection X + Y adder and X + Y -m adder that are designed in such a manner that leads to a substantial simplification and thus to a smaller delay or a smaller number of pipeline levels. Both adders are modifications of the standard CLA adder. In the first stage of the proposed structure the propagate's and generate's and transfer functions [30] t i = a i + b i are used. The first three carries c 1 , c 2 and c 3 are computed simultaneously, and c 3 is used to generate c 4 and c 5 .
Generally, the computation of the carry c i can be expressed, assuming c 0 = 0, as
In the above formulas instead of p i , the transfer function t i = a i + b i is used, which is justified as follows
We may express c 2 and c 3 as the functions of g i and t i as
Consequently, we receive 
In the adder realization the above equations are transformed to the NAND form. The sum bits are generated using s i = p i-1  c i . 
The hardware amount of the X + Y adder can be expressed as It is seen that the area-delay product has the best values for the TOMA-RCA and the new TOMA, moreover the new TOMA requires the smallest area for the pipelined structure but at the cost of the reduced maximum pipelining frequency. In general the new pipelined TOMA calls for about 35% less area than the TOMA-BK, the best of three other considered structures. a0  b0  a1  b1  a2  b2  a3  b3  a4  b4   p0  g0  g1 p1  t1  p2  g2  t2  t3 g3  p3  t4 g4  p4 c3 c2 c4 c5 
MUX
s0=p'0 s1=p'1 s2=p'2 s3=p'3 s4=p'4 z 1 L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
Conclusions
The structures of pipelined two-operand modular adders for five-bit moduli based on ripple carry-adder, Brent-Kung adder and Hiasat adder have been presented and analyzed with respect to the area, number of layers and attainable pipelining frequency. Also a new structure of the two-operand modular adder based on the modified carrylook ahead adder has been proposed. It has been shown that the new pipelined adder has the smallest number of pipeline layers as well as the area smaller by about 35% than the best of other considered structures. 
Robert
