A novel approach for the reduction of the power dissipated in a signal processing application is introduced in this paper. By exploiting the properties of the Polynomial Residue Number System (PRNS) and of the arithmetic modulo´2 n · 1µ, the power dissipation of implementing cyclic convolution is reduced up to four times. Furthermore, the corresponding power¢delay product is reduced up to 2 4 times, while a simultaneous reduction of area cost is achieved. The particular performance improvement becomes possible by introducing a way to minimize the forward and inverse conversion overhead associated with PRNS. The introduced minimization exploits the fact that for the conversions for particular lengths of data sequences and particular moduli, only multiplications with powers of two and additions are required, thus leading to low implementation complexity. In addition multiple supply voltages are utilized to further reduce power dissipation by more than 30% for particular cases. Formulas that return the applicable supply voltage values per PRNS channel are derived in this paper.
INTRODUCTION
Recently low power consumption has emerged as a major design optimization objective due to the need for portable electronics equipment as well as a means for increasing the reliability in high-performance systems. Several design techniques have been proposed to minimize power dissipation, spanning all levels of the design abstraction [1] , from system level down to the VLSI technology level. Among the various optimization techniques, optimal arithmetic selection has been shown to have an impact on overall system power dissipation [2] . In this paper, a new approach is proposed for the design of low-power digital signal processing equipment. By utilizing a number theoretic approach, it is shown that a significant reduction of power dissipation can be achieved, combined with a reduction of the system area cost, at the cost of increased delay. Further optimization is possible, by utilizing multiple supply voltages.
The Residue Number System (RNS) [3] is an integer system capable of supporting parallel, carry-free, high-speed arithmetic. The system also offers some useful properties for error detection, error correction and fault tolerance in digital systems. Important areas of application of the RNS include Digital Signal Processing (DSP) intensive computations, such as digital filtering, convolutions, correlations and DFT and FFT computations. Recent work in RNS arithmetic has resulted in the development of the Polynomial Residue Number System (PRNS) [4] which is capable of multiplying two polynomials using minimum computational complexity.
The PRNS examines the problem of multiplying two´N 1µ-degree polynomials mod´x N ¦ 1µ in some modular ring Z m 0 1 m 1 , a ring which is closed with respect to the operations of addition and multiplication mod m. This system can perform the above polynomial product using the minimal number of multiplications. It should be noted that the polynomial product of two´N 1µ-degree polynomials mod´x N 1µ implements the cyclic convolution of two N-point sequences, a task which is useful in efficiently computing linear convolutions. It should also be noted that the linear convolution of two sequences is a very useful computation because it mechanizes digital filtering. The PRNS isomorphic mapping is given by 
Eq. (6) Let C PRNS and C non-PRNS denote the computational requirements for computing A´xµB´xµ x N ¦1 in Z m using the PRNS and the traditional technique respectively. Then C PRNS and C non-PRNS are given by
In (7) and ( (3) and (5) become multiplications by powers of two which can be implemented with simple shift operations simplifying this way the computational hardware. It can easily be shown that for several moduli of the form m 2 n · 1 and several choices of N, all the N roots r i , i 0 1
2 n ·1 are all perfect powers of two. If the diminished-1 system [5] , [6] is used for performing arithmetic mod 2 n · 1, then multiplications by powers of two can be implemented with leftwise rotations and complementation operations and this way very little computational hardware is required; (only the inverters responsible for complementing some bits of the number being rotated). More on diminished-1 arithmetic mod´2 n · 1µ will be offered in section 2 of the paper. [4] .
Due to the fact that if N is even and r is a root of x N ¦ 1 0 m then r m is also a root, the number of scalings required for the forward PRNS mapping can be reduced to almost one half.
The remainder of the paper is organized as follows: In section 2 the basics of diminished-1 arithmetic are reviewed. In section 3 the area, time and power dissipation performance of a PRNS architecture that exploits the scalings by powers of two for the forward and inverse converters is quantified and compared to a non-PRNS architecture. The impact of multiple supply voltages on the PRNS convolver architecture is discussed in Section 4. Finally, conclusions are discussed in Section 5.
DIMINISHED-1 ARITHMETIC
Diminished-1 arithmetic has been proposed by Leibowitz [5] as an efficient means for performing arithmetic modulo 2 n · 1. In diminished-1 arithmetic, the quantity x 1 2 n ·1 is used as an image of x ¾ Z 2 n ·1 . The particular mapping allows non-zero quantities to be represented using n bits, while zero is mapped onto the quantity 2 n , which requires n · 1 bits for its representation.
When performing arithmetic mod´2 n · 1µ using the diminished-1 system, all input operands and the corresponding results are expressed in diminished-1 form.
By exploiting the diminished-1 representation, addition mod 2 n · 1µ is performed as an end-around carry operation, in two phases: An ordinary n-bit addition is performed, the carry out of which is negated and added back. Efficient parallel VLSI diminished-1 structures for modulo 2 n · 1 two-operand addition have also been recently proposed [7] .
Negation mod´2 n · 1µ is performed in diminished-1 system as follows: When A 0 and A ¾ Z 2 n ·1 , A dim-1 A 1. By taking the one's complement of A 1, the quantity A 2 n ·1 in diminished-1 format is obtained. In PRNS processing, scaling by powers of two is an important operation and it can be efficiently implemented in the diminished-1 system. In particular, 
PERFORMANCE OF PRNS ARCHITECTURES
In the following the area, time and power dissipation performance of N-point cyclic convolution using the PRNS is quantified. The organization of the PRNS system is depicted in Fig. 1 . The PRNS performance is compared to the performance of an architecture that employs conventional modular arithmetic. It is shown that the PRNS can substantially reduce both the area cost and the power dissipation of a system, even when the corresponding forward and inverse conversion overhead are taken into consideration.
The comparisons assume VLSI PRNS architectures that employ the modulo-´2 n · 1µ multiplier by Wang et al. [6] , and carry-savé 3 2µ-counter Wallace trees for the N-operand additions mod´2 n · 1µ, required by the forward and inverse converters. These structures employ diminished-1 arithmetic mod 2 n · 1. The non-PRNS architectures employ the same multiplier and multi-operand adder structures for the direct computation of the cyclic convolution.
A novel observation is employed in this paper to simplify the converter design and hence reduce the PRNS implementation complexity. In particular, when certain moduli m 2 n · 1 are utilized for certain values of N, the roots of the polynomial x N 1, and the multiplicative inverses of the roots and of N in Table 1 for several values of N and m. Hence, the scalings in (3) and (5) are reduced to scalings by powers of two, which can be efficiently implemented in diminished-1 arithmetic as rotations and bit-negation operations, with very low hardware complexity. This is demonstrated for N 8 and m 2 4 · 1 17 in Fig. 2, for a £ 1 . The remainder of the a £ i require scalings of similar implementation complexity.
The area, time and power dissipation performance of the PRNS-enhanced cyclic convolution architectures is summarized in Table 2 , for various numbers of points N and several moduli of the form m 2 n · 1. The relative performance of a PRNS and a non-PRNS architecture are compared in terms of the ratios
, where A x , T x , P x , and PT x denote the area, time, power dissipation and power¢delay product complexity of the architecture x, i.e., non-PRNS or PRNS. When a ratio assumes a value r larger than one, r 1, then the performance of the PRNS system is r times better than the corresponding non-PRNS architecture; a value r 1 denotes worse Table 2 : Area, time, power dissipation, and power¢delay performance of N-point cyclic convolution using PRNS and modulo 2 n · 1 arithmetic, compared to a non-PRNS implementation, for single-modulus channel implementations.
performance of the PRNS system. The area, time and power dissipation performance of the cells that build the diminished-1 arithmetic circuits, for both PRNS and non-PRNS architectures, is obtained from a 0 7-µm CMOS library [8] . Table 2 reveals that the totally parallel implementation of the N-point cyclic convolution using PRNS, achieves significant area and power savings, at the cost of a higher delay, due to the forward and inverse conversion. As shown in the sixth column of Table 2 , the power¢delay product of the PRNS implementation can be up to 2 4 times better than the power¢delay product of the traditional implementation, only at a fraction of the area. As the number N of points grows larger, PRNS area and power dissipation savings increase, over the corresponding performance of the non-PRNS system. The particular behavior of the experimental results is consistent with the computational complexities given by (7) and (8).
MULTI-VOLTAGE CONSIDERATIONS
In case of a PRNS system which includes several moduli, further optimization is possible, by using a different supply voltage for each modulo channel. The capacitance along the critical path of the employed modulo m 2 n · 1 diminished-one multiplier is C crit´m µ h´log 2´m 1µµC FA · log 2´m 1µC FA · C mux (9) where C FA is the capacitance of an 1-bit full adder, C mux is the capacitance of an 1-bit two-input multiplexer, and h´Lµ returns the height of a Wallace tree that adds L operands and it is recursively computed using [9] :
The critical path capacitance can be exploited to derive the delay along the maximum delay path of the particular multiplier architecture can be approximated by (cf. [10] ):
where V is the supply voltage, k depends on implementation technology parameters, and V th is the device threshold voltage. Eq. (12) implies that the delay of a particular modulo-m multiplier depends on m and the supply voltage V . The different delays of the various modulo channels can be balanced by properly selecting the supply voltages of each channel. The utilization of multiple supply voltages in RNS FIR filters has been proposed by Del Re et al. [11] . In this paper, we study the application of multiple supply voltages to PRNS architectures, and derive models and formulas that return the supply voltage value per channel. Therefore, for a PRNS system employing three moduli channels 2 4 · 1, 2 8 · 1, and 2 16 · 1, the supply voltage for each channel can be computed by posing the requirement that the channels that correspond to smaller moduli demonstrate equal delay time to the larger moduli channels, i.e.,
Let V denote the supply voltage of the modulo 2 16 · 1, and V 17 β 17 V and V 257 β 257 V denote the supply voltages for the channels mod 2 4 · 1 and mod 2 8 · 1. By combining (9)-(14), equations can be formed the solution of which allows the computation of the supply voltage reduction factors β 17 and β 257 :
The solution of (15) and (16) returns
From the two values obtained for each of b 17 and b 257 , the one that leads to V 17 V 257 V th is not legitimate [10] . In order to quantify the voltage reduction factor, capacitance values taken from a 0.7-µm CMOS standard-cell library [8] are utilized as follows: Assuming C FA 0 054pF C mux 0 067pF, V th 0 6V and V 5V, it is obtained that β 17 0 472 and β 257 0 674. Therefore, without affecting the overall system delay, the mod 17 and mod 257 residue channels can operate at supply voltages V 17 2 36V and V 257 3 37V. The particular supply voltage reduction directly reduces the overall power dissipation of the system. Assuming a Wallace-tree based implementation of a binary multiplier, the power dissipated by an 8-point circular convolution by means of a three-modulus PRNS system which offers a dynamic range of 28 bits, is reduced by 30%. It is noted that the particular performance improvement does not affect the latency of the system. Furthermore, when compared to a full-parallel conventional binary (non-RNS) implementation of the the 8-point convolver, a fivetimes reduction in power is anticipated.
CONCLUSIONS
In this paper it has been shown that by properly selecting the modulus of operation, the conversion complexity inherent in a PRNS-based architecture can be reduced to scalings with powers of two and additions. This substantial reduction leads to significant power¢delay product and area savings, in comparison to a traditional architecture for the implementation of N-point cyclic convolution. In addition, in a multi-modulus PRNS VLSI architecture, the different delays displayed by the residue channels can be exploited to further reduce power dissipation. This is achieved by reducing the supply voltage of the smaller residue channels, as dictated by (17) and (18).
