nanoseconds. We selected a representative example of a FIR filter with binary weights, and verified using simulation results that the neural network yields weights that enable the filter to perform very close to the theoretical peak performance that one can obtain from the given filter. We showed that the conventional LMS approach is unable to match the performance of the neural network since it cannot select the correct minimum from all the possible minima of the error function.
I. INTRODUCTION
Traditional weighted number system suffers from the carry propagation from low to high significance digits. As a consequence to this phenomenon, there is a slowdown in arithmetic operations like addition and multiplication. The carry can be accelerated using special techniques at the expense of additional hardware. Residue Number System (RNS) is a carry-free system with the capability to support high-speed concurrent arithmetic [ 11. Applications such as fast Fourier transform, digital filtering, and image processing utilize the high speed RNS arithmetic operations; addition and multiplication, they do not require the difficult RNS operations such Manuscript The scope of this brief is modulo multiplication. In general, lookup tables and PLA's [6], [7] have been the main logical modules used when the data granularity is the word. It has been found that such structure is only efficient for small size moduli. For medium and large size moduli, bit-level structures are more efficient, where the data granularity is the bit [17] . Most of the reported works for large moduli multipliers are based on using special set of moduli. In [6], [8] the moduli should be prime numbers. In [9] , [ 101 the moduli should be from the set (2" -1. 2". 2" + 1). Due to the constraints imposed on the chosen moduli, such approaches have limited applications.
In this brief, we present a modulo multiplier for medium size and large moduli. The multiplier is based on using a 6' ( 1) modulo adder [l 11. It is configured as a two dimensional array of very simple cells (modulo adder). The modulo multiplication is performed in @(log n ) steps with no constraints imposed on the chosen moduli.
RESIDUE NUMBER SYSTEM
In RNS, an integer, S , can be represented by S-tuple of residue digits,
where r , = IXlmx, with respect to a set of S moduli { m l . nr2.
. . . , m s } . In order to have a unique residue representation, the moduli must be pairwise relatively prime, that is,
then it is shown that there is a unique representation for each number in the range of 0 5 X < n ni, = AI, where S is the number of moduli.
The arithmetic operation on two integers A and B is equivalent to the arithmetic operation on their residue representation, that is, where "." can be addition, subtraction, or multiplication. It is desired to convert binary arithmetic on large integers to residue arithmetic on smaller residue digits in which the operations can be parallelly executed, and there is no carry chain between residue digits.
A. The Modulo Multiplication
Generally, multiplication modulo m has 2" -m ( n = [log nrl ) incorrect residue states. These states are in the range [nr. 2" -1 1 which may be called overflow states. The corrected residue numbers can be obtained by two methods; employing a binary adder or a correction table. In the first method, a constant (2" -ni ) is added to correct the overflow residue states (generalized end-round carry) as shown in Fig. 1 . The multiplication is performed as follows:
A n-bit multiplier and an inner-product cell are used; the multiplier computes q * x 2 , while the inner-product cell computes x1 *sg -m . 
I P
The carry bit generated from the second unit indicates whether or not 21 * zz is greater than m. A multiplexer, controlled by the carry, selects the correct output. In the second method, a look-up table is used to correct the incorrect residue states (2" -m ) , Fig. 2 . The first algorithm of modulo multiplication is slow, and the second algorithm is not suitable for medium and large moduli.
MODULO ADDER
The modulo adder is the basic kernel used in performing the modulo multiplication. The modulo adder is based on representing a number as a Curry and a Sum to obtain a scheme that has a constant speed which does not depend on the number of bits [ 111. The modulo adder is used to add two numbers A and B in modulo m. Fig. 3 shows that A is represented as a pair of numbers (As, A c ) , B is also represented as ( B s , Bc), and the output C is represented as (Cs, Cc). Each number is represented as a group of Sum bits and Curry bits. There is no unique representation for A s and 4 c . The condition that need to be satisfied is:
One possible representation is:
The choice of a representation has no implication on the complexity of the design. With such representation, four numbers 
A. The Modulo Addition Algorithm
can be described as follows:
The proposed algorithm for modulo m addition of two numbers 
end.
An implementation of the algorithm is shown in Fig. 4 . The proof that the modulo adder scheme for adding two n-bit numbers in modulo m has an asymptotic time complexity @( 1) is shown in [ 111 with an example for the addition operation.
cany (-4, B. c, D)
411 := 0 IV. THE MODULO MULTIPLIER The modulo adder presented in the previous section is the main kernel of the modulo multiplier. The multiplier consists of two stages.
In the first stage (Fig. 5) an array of AND gates is used to obtain the partial products. The second stage of the adder is a binary tree of modulo adders used to perform the addition of the n partial products (Fig. 6) . The correctness of the operation is established by the following theorem. Adding n numbers (yl, yz, . .. , y n ) in modulo m can be performed in @(log n ) time complexity using modulo adders. Proof: The addition of n numbers modulo m can be performed as follows:
Adding (yl, y2) modulo M , ..., (y2, yt+i), .... and (~"-1, y n ) gives Y I Z , ..., ~("-1)".
Step (1) is repeated on (~1 2 , Step (2) is repeated for [log -2 times to obtain one final output represented as a sum and carry.
The previous method needs log n step to be performed. We need to prove that the previous procedure returns the required 
From the previous four cases:
Using ( We can further expand this expression using the same method to get the addition process in the right hand side in terms of only two 0 Theorem 1 means that adding 11 numbers in modulo . \ I can be performed using a binary tree consists of units that are capable of adding only two numbers in modulo -11. The modulo adders are used as those building blocks to perform the addition process. Since the modulo addition requires that inputs be represented in the form of sum and carry, then this form should be enforced at all levels. The form will be enforced automatically for levels 2 2 , because the outputs of the previous levels are in the correct form. For first level we have the following: operands added in modulo -11. For the second stage the output is in the form of sum and carry which is exactly the same form we have using the CSA's.
V. MODULO MULTIPLIER EVALUATION

A. Asymptotic Complexity
Using the VLSI model of computation for asymptotic complexity [ 151, a comparative study for the proposed multiplier is analyzed. For multiplier I (Fig. 1 ) the complexity measures will be as follows:
T =e (,,) ;IT2 = @ ( n 4 ) .
TABLE I COMPARISON BETWEEN DIFFERENT MODULO MULTIPLIERS
E
Proposed
For multiplier 11 (Fig. 2) . using the complexity analysis of the correction .A =0(n2 +2"11)
For the proposed multiplier, the first stage consists of 11' AND gates. The area of this stages is @ ( i t 2 ) . The partial products are obtained after a constant time (The AND gate delay). For the second stage number of adders required to perform this stage is:
Since each modulo adder has @ ( t i ) full adders, then the total area required for this stage is @ ( ~1 ' ) . The time required to perform the addition process is the delay of the modulo adder multiplied by the depth of the tree @(log 1 1 ) .
-4 = @ ( n 2 )
T =@ (log 1 1 ) .
From the previous analysis it is clear that the proposed multiplier is superior than previously proposed schemes for medium and large moduli. Table I shows a comparison between the three schemes.
B. Layout Complexity
An 8-b multiplier is implemented based on Domino Logic using a double metal 3-micron CMOS technology. CRYSTAL program is used to analyze the VLSI layout. The input consists of a circuit description extracted from the mask layout using MEXTRA program. Crystal determines each clock phase length and the circuit's critical paths. This helps in performance tuning by optimizing the critical paths. Although Crystal is designed for systems using multiple nonoverlapping clocks by determining each clock phase length, it checks neither clock skew nor set-uphold times. Crystal can efficiently analyze the systolic multiplier due it its simplicity and regularity. Table I1 summarizes the area and time of different components of the design.
For the modulo adder we use three cell types. Type I is used in the implementation of stages one and two, Type I1 is used in the implementation of stage three, and Type I11 is used in the implementation of stages four and five. Type I consists of n fulladders, Type I1 consists of a three input multiplexer and 11 full-adders, and Type I11 consists of two input multiplexer and n full-adders. The longest delay among the three types is for type 111. The length of type I11 delay determines the clock period. The clock period is 32.40 ns, which gives a throughput of 31 -If modulo multiplication operation per second. We can generalize the previous figures for n-bit modulo multiplication. The total delay consists of the AND gates array delay and the tree's delay. The array has a delay of 1.45 ns and each adder has a delay of 124.9 ns, then:
Total delay = 124.9 * log n + 1.45 ns.
The clock period is constant regardless of n, then:
Clock period = 32.40 ns. Throughput = 30M modulo-multiplicatiods. Table I11 shows the total delay for different multiplier sizes.
VI. CONCLUSION
The modulo multiplier introduced in this brief has a total time-delay complexity of @(log n) for multiplying two n-bit numbers in modulo m. Based on the analysis of Section V-A, this adder is the fastest and the most area efficient for large moduli. The VLSI implementation using double metal 3 micron shows that the pipelined multiplier can operate with a clock rate of 32.4 ns, which leads to a throughput of 30 M modulo multiplication operation per second. The proposed design has the following advantages:
1) It does not have any limitation on the size of the modulus.
2) It is quite modular, it is a 2-D array of two cell types (modulo adder, AND gates).
3 ) It is easy to pipeline yielding a very high throughput. 4) It is very fast and area-efficient compared with other schemes.
