Abstract-In this brief, a novel and high-speed realization of bipolarvalued inner product processor for associative memory networks is presented, wherein the treatment of inner product of two bipolar vectors is given. Besides, a systolic architecture of digital compressors is used to reduce the carry propagation delay in the critical path of the inner product of two bipolar vectors.
I. INTRODUCTION
Applications of associative memories include pattern recognition [6] , code correcting, storage of words and speech data [1] , chaotic neural network [9] . In the computation of associative memories [8] , the inner product of two vectors might be one of the most frequently used mathematical operations, since the inner product is the core process of recall computations of associative memory [10] . Accordingly, the demand of shortening the delay therewith becomes urgent. Notably, the bipolar-valued data are more commonly used in digital associative memories [3] , [5] . Many efforts have been thrown on implementing the associative memories with hardware circuits [2] , [4] , [7] . However, all of these implementations pay attention to the realization of the binaryvalued associative memories but leave the problem of inner product of two bipolar-valued vectors unresolved. In this paper, a bipolar-valued inner product processor for associative memory networks is proposed to compute inner product of two bipolar vectors, wherein a systolic architecture of the digital compressors based on the 3-2 compressor building blocks are included to compute the summation of the individual inner product terms.
II. ASSOCIATIVE MEMORY NETWORKS
A traditional feedforward heteroassociative network stores N -sample pattern pairs, which are f(X1; T1); (X2; T2); 11 1; (X N ; T N )g, where X i 2 f01; +1g m , T i 2 f01; +1g p , and i = 1; 2; 1 11;N. The objective is to retrieve pattern Y k (k = 1; 2; 1 11;N) where Y k = T k whenever X k is the input to the network. We define a correlation matrix [5] M i = X T i T i for the pair of patterns to be associated, (X i , T i ). The individual pattern matrices Mi can then be superimposed to store N patterns in the matrix M , where
Mi. Now assume X k is the input pattern, which is one of the stored patterns, Y is the output pattern, and X i 's and T i 's are stored pattern pairs. The input is processed and transferred to the output as follows: then the network will produce the stored pattern X k when the key pattern X is presented as an input. Namely, an undistorted prototype vector in response to the distorted prototype key vector can be recovered by the memory. Vector X k can be regarded in such a case as stored data and the distorted key serves as a search key or argument. Notably, when the cardinality of pattern pairs is large in the above calculation, the carry propagation of the inner product of the bipolar-valued vectors will likely become the critical delay of the entire neural computing.
III. HIGH-SPEED BIPOLAR-VALUED VECTOR INNER PRODUCT PROCESSOR
The entire design of bipolar-valued inner product processor is divided into four parts, including an individual inner product term generator, a compressor unit, a bipolar-to-binary value converter, and an inner product adjustment unit. The individual inner product term generator produces the individual inner product terms given two bipolar-valued vectors, and passes them to the compressor unit, which computes a summation of inner product terms. Then, a conversion from bipolar-valued digits to binary digits is used to feed a two's complement number into the last processing stage, where the augmented inner product is corrected to the precise value. Fig. 1 shows the architecture of the bipolar-valued inner product processor.
A. Inner Product Term Generator
Since 01 can be represented as 11 111 11, and +1 as 00 1 1101 in two's complement form, we can use the most significant bits (MSB) of the vector components to differentiate between +1 and 01 when the inputs to the inner product processor are restricted to the bipolar values. The cardinality of the bipolar vectors may be less than or equal to the number of the XNOR gates, TABLE I  TRUTH TABLE FOR SUMMATION OF THREE BIPOLAR BINARY INPUTS the bipolar vectors is less than 2 n 0 1, all the unused inputs are set to zeros, which pad the length of input vectors to 2 n 0 1. Therefore, the inner product needs to be adjusted to the precise value later because it is augmented at this stage by the accumulation of the unused XNOR gates' outputs.
The summation of three bipolar digits q, r, t can be expressed as follows:
q + r + t = S1 1 2 1 + S0 (2) where S1 and S0 are bipolar digits.
All possible combinations of (2) are illustrated in Table I , which is identical to the truth table of a full adder if a logic LOW represents 01
and a logic HIGH represents 1 in this table. Therefore, we tend to construct the compressor unit with full adder building blocks to calculate the summation of the individual inner product terms.
1) Basic 3-2 Compressor
Building Block: A 3-2 compressor is basically a full adder. The feature of such a compressor is that the output represents the number of 1's given in inputs. The equations of a full adder are presented as below:
where a and b are inputs, Cin is a carry-in from a previous addition, C out and Sum are the carry and the sum outputs, respectively. The output of the compressor unit denotes the number of logic HIGH at the inputs.
2) Framework of (2 n 01)-to-n Compressor: The compressor unit we propose to improve the carry propagation delay of the critical paths consists of parallelized 3-2 compressor building blocks at every processing stage. To illustrate the functionality of the compressor unit, a 63-to-6 compressor that sums the 63 data inputs each with one bit is shown in Fig. 2 . At the first processing stage, 21 3-2 compressors are used to generate 21 bits at the second bit and the least-significant-bit position, respectively. Then, 14 3-2 compressors at the second processing stage produce 7 bits at the third bit position, 14 bits at the second bit position, and 7 bits at the least significant bit position. Following the same fashion, a total of 57 3-2 compressors and nine processing stages are needed to produce the sum of 63 bipolar bits. The total delays are also approximately proportional to the count of 3-2 compressors that the critical path resides, as shown as the dashed line in Fig. 2 . The carry propagation delay of (2 n 01)-to-n compressors, where n from 2 to 13, has been computed by a program and can be formalized as follows:
Dcmpr(n) = 2n 0 3; 2 n 6 2n 0 4; 7 n 13 (4) where Dcmpr (n) denotes the total delay of a (2 n 01)-to-n compressor,
counted by the number of 3-2 compressor blocks. 
C. Bipolar-to-Binary Value Converter
The next processing stage is to convert the bipolar-value vector generated from the (2 n 0 1)-to-n compressor unit into the two's complement representation of signed binary numbers. Note that a logic LOW represents 01, and a logic HIGH denotes 1 at the output of compressor unit. Besides, (n + 1)-bit two's complement form is required to represent the n-bit output of the compressor unit since its value falls into the range from 0(2 n 0 1) to (2 n 0 1). We can find the equations for this converter as follows.
We assume the bit representation of the output of the compressor unit is (Z n01 ; . . . ; Z 1 ; Z 0 ), Z i 2 f1; 0g. Keep in mind that the actual value of the bit representation {1,0} is, in fact, f+1; 1g. To avoid any overflow, we add an additional bit to the MSB. 
where ((Z n01 ; . . . ; Z 0 ) + ( Z n01 ; . . . ; Z 0 ))= ((Z n01 ; . . . ; Z 0 ) shift left one bit).
We, thus, can implement the bipolar-to-binary value converter as shown in Fig. 3(a) according to (5) .
Example 1 
D. Inner Product Adjustment Unit
The inner product has to be changed to the precise value. Consider the case that the dimension of bipolar vectors, d, is less than 2 n 0 1.
Firstly, d, represented as n-bit values, is fed into the addend inputs of the (n + 1)-bit carry-lookahead adder (CLA) as illustrated in Fig. 3(b) , where the MSB of the (n + 1) bit inputs is set to 1. We have already attained the one's complement form of 0(2 n 0 1 0 d) at the addend inputs of the adder. Thus, the output of the CLA is the two's complement form of 0(2 n 0 1 0 d) since the augend inputs are set to 1.
E. Performance Analysis
A simple yet slow bipolar-valued inner product processor is presented in Fig. 4 to facilitate the overhead analysis of our design. Although the hardware complexity of the above-mentioned primitive inner product processor is simpler than our proposed scheme, the total delay of the inner product calculation caused by this simple yet slow architecture turns out to be Delay Primitivescheme = (2 n 01)(dXNOR+dReg+dMUX+dCLA) (6) where dXNOR denotes the delay of the XNOR gate, dReg represents the delay of the register, dMUX stands for the delay of the multiplexer, while dCLA denotes the delay of the (n + 1)-bit CLA.
As for the total delay of our proposed scheme, it can be expressed as follows:
where d 302Comp stands for the delay of the 3-2 compressor unit.
From (6) and (7), it is obvious that the total delay of inner product calculation is reduced significantly in our proposed scheme.
IV. CHIP IMPLEMENTATION AND MEASUREMENT
The proposed processor was approved by the Chip Implementation Center (CIC) of the National Science Council (NSC), and then fabricated by the Taiwan Semiconductor Manufacturing Company (TSMC) 0.6-m 1P3M technology. The total area of this chip is 2260 2 2244 m 2 , core area is 1203:7 2 1198:9 m 2 , and the gate count is 2.3 K.
A. Physical Chip Testing
Fig . 5 is the die photo. Fig. 6 is a snapshot generated by HP 1660CP when the chip is under test. The maximum clock rate is 100 MHz which is the same as predicted by the simulation results, and the power consumption is 300 mW.
V. CONCLUSION
We have proposed a novel architecture of bipolar-valued inner product processor which can be employed in the implementation of associative memory networks. The systolic architecture of (2 n 01)-to-n compressor can significantly reduce the carry propagation delay in the critical path of the bipolar binary inner product, which is clearly the bottleneck of the whole computation. The physical chip implementation is also presented. The simulation results turn out to be very appealing.
