Inner product calculations are often required in digital neural computing. The critical path of the inner product of two binary vectors is the carry propagation delay generated from individual product terms. In this work, two architectures to arrange digital ratioed compressors are presented to reduce the carry propagation delay in the critical path. Besides, the carry propagation delay estimation of these compressor building blocks is derived and compared. The theoretical analysis and Verilog simulation both indicate that one of the compressor building blocks we present here might offer a sub-optimal solution for the basic building blocks used in digital hardware realization of the inner product computation.
INTRODUCTION
Many efforts have been thrown on the realization of neural networks mainly owing to their attractive pattern recognition features, [1, 2] . In the computation of neural networks, the inner product of two vectors might be one of the most frequently used mathematical operations. Unavoidably the carry propagation will occur if the neural networks are dedicated for either discrete or digital signals. For instance, the recall of pattern pairs stored in discrete bidirectional associative memory (BAM) needs to compute a summation in the form as Y-th( in__l Yi (Xi" X)) where X is the input pattern, Y is the output pattern, Xi's and Yi's are stored pattern pairs, and th() is a threshold function. Notably, the components of every vector are either bipolar or binary. If n is large in the above calculation, then the carry propagation of the inner product of the vectors will likely become the critical delay of the entire neural computing.
Since neural computing is composed of mass amount of inner product calculations, the demand of shortening the delay therewith becomes urgent.
Many high-speed logic design styles have been announced. However, these logics suffer from different difficulties. For example, domino logic [3] can not be non-inverting; NORA [4] has the charge sharing problem; all-N-logic [5] and robust single phase clocking [6] fix such a problem by employing a so-call C2pL (complex CPL), several physical design factors are not fully considered or implemented. First, the sizes of the NMOS transistors for pass logics are impossible to be minimal. Second, the driving inverters' sizes have to be properly tuned. Third, the original design of [9] not only gives a poor fanin and fan-out capability, but also produces very asymmetrical rise delay and fall delay which will very much likely cause glitch hazards and unwanted power consumption. Fourth, no further analysis on reduction of carry propagation delay is performed. In this paper, two alternative architectures of the digital ratioed compressors building follows" (1) where F denotes (a @ c). The feature of such a compressor is that the output represents the number of l's given in inputs. [9] . We use TSMC 0.6tm 1P3M technology to re-design the 3-2 compressors, and the schematic diagrams for the ratioed 3-2 compressors are shown in Figure 2 . In Section 3 of this paper, we will demonstrate that the redesigned 3-2 compressor circuits will overcome all of the problems mentioned in Section 1.
FRAMEWORK OF RATIOED

A Primitive Architecture of Digital Ratioed Compressors
A 7-3 compressor building block can be constructed by cascading four 3-2 compressors as shown in Figure 3 . A 15-4 compressor building block can also be formed similarly with two 7-3 compressors and two 3-2 compressors, as shown in Figure 4 . Based on this design methodology, a (2 1)-to-n compressor can be composed of two (2 1)-to-(n-1) compressors and (n-1) 3-2 compressors.
Since the total delay of such design is approximately proportional to the count of 3-2 compressors that the critical path resides, we assume D denotes the count of 3-2 compressors when 2bits are fed into the (2-l)-to-n compressor block. By observing the structure of the compressor blocks, we can deduce D2, D3, and D as (2) By solving the above recurrence relation, we obtain Apart from the delay for the single building block, we have to count in the processing stages needed for summing individual inner product terms. The numbers of processing stages is roughly estimated as ln(n/M)/ln(n/(2'-1)), where n denotes the total bits of the basic building block output, and M represents the bit count of data inputs.
Therefore, the count of 3-2 compressors when M bits are fed into the (2 n-1)-to-n compressor building blocks can be shown as follows: ln(n/M) n(n 1) D4,, ln(n/(2" 1)) "" (4)
The Systolic-like Architecture of Ratioed
Compressor Building Blocks
The second architecture presented in this work to improve the carry propagation delay of the critical paths is shown in Figure 5 . This architecture, inspired by the design methodology of systolic arrays, consists of parallelized 3-2 compressor building blocks only at every processing stage. Although it is difficult to derive the analytical form of total delay of (2 n-1)-to-n compressors for systolic-like architecture, the upper bound for the delay of (2 n-1)-to-n compressors can be still computed in light of the result given in Eq. Comparing with the first primitive architecture presented in Section 2.3, the systolic-like architecture improves the delay of inner product calculation from O(rt2) to O(n). Apparently this outperformance is associated with the parallelized computing ability at each processing stage as shown in Figure 5 .
3-2 compressor 3-2 compressor
SIMULATION AND ANALYSIS
Re-designed Building Blocks
In order to verify the correctness of our theoretical analysis, we tend to use HSPICE and Verilog to conduct a series of simulations. The improvement of asymmetrical rise delay and fall delay in the original design can be illustrated through HSPICE simulations. The simulation results are tabulated as shown in Table I. 
Delay Simulations
The Verilog simulations are performed 20000 iterations for the first architecture and the systolic-like architecture of 127-7 compressor building blocks, respectively. Table II illustrates the comparison of carry propagation delay for the two architectures of 127-7 compressor building blocks when they are fed with 127 data inputs summation.
The results demonstrate that the systolic-like architecture of digital ratioed compressors indeed lead the least carry propagation delay. 
CONCLUSION
In this paper a re-designed ratioed 3-2 compressor is presented to correct several problems appearing in Zhang's work in [9] . The equations for counting the number of 3-2 compressors in the critical path of (2 n-1)-to-n compressors are derived and used to compare the performance of two digital ratioed compressor architectures. Our simulation results show that the systolic-like architecture gives a suboptimal performance through the parallelized arrangement of 3-2 compressors at each stage of processing.
[7] Yuan, J. and Svensson, C., "High-speed CMOS circuit technique", IEEE J. on 
