Abstract-Inner-product calculations are often required in digital neural computing. The critical path of the inner product of two vectors is the carry propagation delay generated from individual product terms. In this work, a novel and high-speed realization of inner-product processor for the multi-valued exponential bidirectional associative memory (MV-eBAM) is presented in order to reduce the carry propagation delay, wherein the treatment of inner product of two vectors is given. Notably, a systolic-like architecture of digital compressors is used to reduce the carry propagation delay in the critical path of the inner product of two vectors. The architecture we propose here might offer a sub-optimal solution for the digital hardware realization of the inner-product computation.
I. INTRODUCTION

S
INCE Kosko [1] proposed the bidirectional associative memory (BAM), many researchers have invested efforts on exploring the network's properties and limitations. Due to its intrinsic architecture, the capacity of BAM is unfortunately poor [2] . It is notable that Chiueh and Goodman [3] proposed exponential Hopfield associative memory motivated by the MOS transistor's exponential drain current dependence on the gate voltage in the subthreshold region such that the VLSI implementation of an exponential function is feasible. Although the impressive capacity of an eBAM was found [4] , the data representation of BAM or eBAM is still limited to be either bipolar vectors or binary vectors. We consider that the expansion of the data range, i.e., from to is also a feasible method to enlarge the capacity. It also enriches the data representation. This observation leads to the multi-valued exponential bidirectional associative memory (MV-eBAM) [5] .
Since neural computing used in the networks similar to the MV-eBAM is composed of mass amount of inner-product calculations, the demand of shortening the delay therewith becomes urgent. Otherwise, the hardware realization of any neural network becomes impractical. The inner-product Manuscript received February 1999; revised June 2000. This research was supported in part by the National Science Council under Grant NSC 88-2219-E-110-001. This paper was recommended by Associate Editor F. Kub.
The authors are with the Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424, R.O.C. (e-mail: ccwang@ee.nsysu.edu.tw).
Publisher Item Identifier S 1057-7130(00)09927-4. computation of two multi-valued vectors can be done by a process of successive multiply and accumulate operations conventionally [6] , [7] . This method generates the individual inner-product term by using a multiplier and employs a single adder to sum up all the inner-product terms iteratively. The systolic-like architecture of partial product reduction tree for the parallel multiplier introduced by Wallace [8] motivated the implementation of several parallel schemes for the inner-production calculation [6] , [9] . Researchers have also proposed a variety of compressors to speed up the process of partial product reduction in multiplication or inner-product operation, such as 4-2, 5-5-4, or 9-2 compressors, etc. [10] , [11] . However, Oklobdzija et al. [12] pointed out that it is the interconnection of the compressors, rather than the structure of the compressors, that leads to the fastest realization of partial product reduction in multiplication operation. The superiority of the 4-2 compressors and 7-3 compressors built by Zhang et al. [13] verified that the conclusion of Oklobdzija et al. is correct. Besides, Wang et al. [14] compared different compressor architectures and concluded that the systolic-like architecture outperforms the others. Consequently, the novel inner-product processor dedicated to the MV-eBAM presented in this paper will include a systolic-like architecture of compressor unit wherein the arrangement of the 3-2 compressors is tuned in order such that the carry propagation delay in the critical path is 1057-7130/00$10.00 © 2000 IEEE reduced. In addition to the compressor unit, an inner-product term generator is also proposed to produce the individual inner-product terms as the inputs to the compressor unit.
II. THEORY OF MV-eBAM
Before the introduction of the inner-product processor for the MV-eBAM, it is necessary to show how the MV-eBAM operates theoretically. Suppose we are given pattern pairs, which are (1) where where is assumed to be smaller than or equal to without any loss of generality. Hence, the evolution equations of the MV-eBAM are shown as (2) where and key patterns; a positive number, called the radix , th digits of and with and for and , respectively; a staircase function shown as where is the number of finite levels, and is the finite interval of the staircase function. Note that if and , then , for . The reason why the staircase function is used is the argument in in (3) is not necessarily a positive integer. We, hence, have to assign this argument to a nearest integer.
The reasons for using an exponential scheme in (2) are to enlarge the attraction radius of every stored pattern pair and to augment the desired pattern in the recall reverberation process.
In the evolution equations (2), if the given input pattern is close to the desired pattern, the weighting coefficient will be close to the maximum, 1, while if the input pattern is far from the desired one, it will approach zero. As for the purpose of the denominator, it makes the and to be the centroids of all of the 's and 's, respectively. The capacity of the MV-eBAM can be shown to be very close to the maximum number of combinations of the input vector, i.e., when is large enough (Wang and Hwang, 1996) . Hence, the MV-eBAM indeed possesses a high capacity.
However, one serious problem occurs when it comes to the physical implementation of such a high-capacity associative memory by digital VLSI circuits. In the computation of the MV-eBAM, the inner product of two vectors or might be one of the most frequently used mathematical operations. Notably, if or is large in the above calculation, then the carry propagation of the inner product of the vectors will likely become the critical delay of the entire neural computing. This side effect undoubtedly devalues the hardware realization of the MV-eBAM. 
III. HIGH-SPEED INNER PRODUCT PROCESSOR FOR THE MV-eBAM
In order to reduce the carry propagation delay produced in the implementation of the MV-eBAM, it is demanding to develop a special-purpose processor for the inner product of two multi-valued operands. The entire design of multi-valued inner-product processor is divided into two parts, which are an individual inner-product term generator, and a compressor unit. The inner-product term generator produces the individual inner-product terms given two multi-valued vectors, and passes them to the compressor unit, in which a summation of product terms is computed. Fig. 1 shows the data flow of a multi-valued inner-product calculation.
A. Inner-Product Term Generator
Considering the compatibility with the binary digital system, the number of finite levels in (3) is set to in the implementation of the inner-product processor for the MV-eBAM. Besides, the computation of each individual inner-product term in or turns out to be an unsigned integer operation because each term is always positive. Thus, each product term in (2), product, can be evaluated by (4) where is 0 or 1. Notably, the design of the inner-production term generation becomes simple because only AND gates are required to produce the partial products in (4). Fig. 2 shows the configuration of the inner-product term generation unit. Note that the dimension of the stored patterns is set to the count of the inputs to a -to-compressor, which will be introduced in the next section. Therefore, the length of the inputs to the inner-production term generator is , and the length of the outputs is according to (4) . In case that the dimension of the stored patterns is less than , all the unused inputs to the inner-production term generator are padded with zeros.
B. Framework of the Compressor Unit 1) Systolic-Like
-to-Compressor Building Block: A 3-2 compressor is basically a full adder. The feature of such a compressor is that the output represents the number of 1s given in inputs. The equations of a full adder are shown as follows: (5) where denotes . As shown in Fig. 3 , the logic structure of a typical 3-2 compressor can be split up into two logic layers. One of the three inputs, , is not required in the first logic layer. A -to-compressor building block can be constructed by cascading 3-2 compressors, as shown in Fig. 4 . This architecture, inspired by the design methodology of systolic arrays, consists of parallelized 3-2 compressor building blocks only at every processing stage. Note that the number beside the arrow pointing toward each circle represents the count of the inputs to the 3-2 compressors at each processing stage, while the number beside the two outward arrows indicates the count of the outputs of each 3-2 compressor building block. The number inside the circles denotes the count of the 3-2 compressors which process the inputs at some specific bit positions.
To compute the total count of 3-2 compressors used in a -to-compressor, we consider an alternative architecture of the -to-compressor, which is composed of two -tocompressors and 3-2 compressors, as shown in Fig. 5 . Based on the configuration of this compressor, we can derive the count of the 3-2 compressors used in this architecture as follows: (6) where denotes the number of the 3-2 compressors used in a -to-compressor. By solving the above recurrence relation, we obtain (7) The number of 3-2 compressors used in these two architectures we present above is identical because no unused inputs to the 3-2 compressors appear in both -to-compressor structures. Thus, we can conclude that the count of 3-2 compressors used in the systolic-like architecture of the -tocompressor is also .
2) Framework of Digital Compressor Design:
According to (4), the summation of the partial product terms is not computed in the inner-product term generator. This implies that the outputs of the AND gates are fed into the compressor unit at the required bit positions. Besides, individual inner-product terms need to be accumulated at each bit position. Thus, there will be partial product terms at LSB, partial product terms at the second bit position, partial product terms at the th bit position, and partial product terms at the th bit position (MSB), and so forth, as shown in Fig. 2 . Since many accumulation operations must be performed to obtain the final result, the improvement of the carry propagation delay of the critical paths is the major consideration for the architecture of the compressor unit. The entire architecture we propose to achieve this goal is shown in Fig. 6 . Since this compressor unit is composed of one or several -to-compressors at each bit position, we tend to set the dimension of the stored patterns to to reduce the number of the unused inputs to the basic 3-2 compressor building blocks. Although it is difficult to derive a general form of the critical delay for the compressor unit due to its irregular structure, the estimated delay can be derived by attaching fictitious 3-2 compressors of stages on the top of the compressor unit to form a single -to-compressor tree. Since there are inputs to the compressor unit, the length of the inputs to the made-up -to-compressor tree now becomes after tracing back to the top level of the -to-compressor. We assume denotes the count of 3-2 compressors in the critical path of the compressor unit, then the delay can be derived as follows: (8) The number of 3-2 compressors used in the compressor unit can be estimated based on (7) . Let denotes the number of the 3-2 compressors used in the made-up -to-compressor tree. First, we get (9) (10) Fig. 9 . Schematic diagram of the 3-2 compressor building block. (11) where denotes the count of the 3-2 compressors at the fictitious levels. Then, can be computed as follows: 
IV. SIMULATION AND ANALYSIS
A. Performance Analysis
A conventional multi-valued inner-product processor is presented in Fig. 7 to facilitate the overhead analysis of our design. As Fig. 7 shows, the components of the two input vectors are fed into the individual inner-product term generator serially, where the number of finite levels, , in (3) is set to . During each cycle, the outputs of the individual inner-product term generator is streamed into a Wallace-tree multiplication array to obtain the -bit product. Notably, a fast adder such as carry-lookahead adder (CLA) is required at the final stage of the multiplication. Then the product is fed into another CLA to get the accumulated partial sum of the inner product. Since the maximum value of the inner product, , can be derived by (13) for and , the output bit length of the CLA is required to be at least , which is also the output bit length of the compressor unit as given in Fig. 1 .
Similar to the approach taken for the derivation of (8) and (12), the propagation delay of the Wallace-tree multiplication array counted by the number of 3-2 compressors can be estimated as follows: (14) and the approximate number of 3-2 compressors used in the multiplication array becomes (15) Next, we need to evaluate the critical delay and the hardware complexity of the two CLAs. Based on the tree-like architecture of the CLA proposed by Dozza et al. [15] , it can be shown that In summary, the conventional scheme requires an extra -bit CLA, an -bit CLA, and an -bit register. However, the compressor unit as shown in Fig. 6 is replaced by the simpler Wallace tree, and only one set of individual innerproduct term generator is needed in this scheme. The extra hardware cost for our proposed scheme can be estimated as follows: 1) AND gates for the inner-product term generator; 2) 3-2 compressors used in the compressor unit. Although the hardware complexity of the above-mentioned conventional inner-product processor is simpler than our scheme, the total delay of the inner-product calculation caused by this simple yet slow architecture turns out to be
where delay of the AND gate; delay of the Wallace tree multiplication array;
-bit CLA delay of the CLA;
-bit CLA delay of the CLA.
As for the total delay of our proposed scheme, it can be expressed as follows:
where stands for the delay of the compressor unit.
From (8) , (14), (16), (18), and (19), it can be shown that the total delay of multi-valued inner-product calculation counted by the number of 2-input logic gates is reduced by (20) As seen from (20), the delay of inner product is improved significantly in our proposed scheme.
B. Verilog Simulations
In order to verify the correctness and the performance of the implementation of the inner-product processor for the MV-eBAM, Verilog HDL is used to conduct a series of simulations with over 20 000 random testing vectors to explore the critical delays of the proposed architecture. The dimension of the input vectors to the inner-product processor is 31, and each digit is 2 bits wide. The simulation results indicate a delay of about 6.4 ns for the critical paths. Fig. 8 shows a waveform diagram sample of inner-product computation of two testing vectors. Notice that the testing vectors given in Fig. 8 are the inputs to the compressor unit.
C. Chip Implementation
We use the Taiwan Semiconductor Manufacturing Company (TSMC) 0.6-m 1P3M technology to design the chip. The 3-2 compressor building block is designed as shown in Figs. 9 and 10. Then we use Cadence Silicon Ensemble automatic place and route tools to generate the abstract view and the layout of the chip. At last, the DRACULA and TimeMill are utilized to execute the full-chip-scale post-layout simulation. The circuit layout of the inner-product processor is given in Fig. 11 .
V. CONCLUSION
In this paper, we have proposed a novel architecture of the inner-product processor which can be employed in the implementation of multi-valued exponential bidirectional associative memory. The inner-product processor consists of two major components: the individual product term generators and compressor units. The design of the individual product term generator is simplified because only the partial product terms are generated at this stage. The summation of the partial terms and the individual inner-product terms are accumulated at the next stage, i.e., the compressor unit. The systolic-like architecture of the compressor units can significantly reduce the carry propagation delay in the critical path of the inner product, which is clearly the bottleneck of the whole computation.
