U n i <e r s i t a t D o r t m u nd Abstract Two VLSI special-purpose hardware implementations of an associative memory model are described: a pure digital and a mixed analog/digital architecture. Both architectures can be easily extended t o large scale memories with several million storage elements. The advantages and disadvantages of both architectures are pointed out. The memory concept is based on a simple matrix structure with nxm binaq elements, the connections, and on distributed storage of information like artificial neural networks. There is no asynchronous feedback and the inputs and outputs are binau, too. Though the system concept is very simple, it has an asymptotic storage capacity of 0.69-m.n bits and the number of patterns that can be stored with low error probability is much larger columns (artificial neurons). The applications is that the input have t o be sparsely coded.
Introduction
In the last decade there has been an increasing interest in the use of artificial neural networks in various applications. One of these application areas, where the analysis of the performance of a neural network approach is comparatively advanced is associative memory. Neural networks are well suited for the implementation of associative correlation memories cl]. The idea is that information is stored in terms of synaptic connectivities between artificial neurons. The activities of the neurons represent the stored patterns. Many different models have been discussed in literature under such names as "Lernmatrix'*, "Correlation Matrix'., "Associative Memory" etc. [l-41. Recent advances have been largely supported by simulations on conventional computers. However, if these models should offer a viable alternative for storing and processing information in large scale applications (e.g. pattern recognition) these systems will have to be implemented in hardware. Because of their regular and modular structure, neural networks are well adapted for VLSI system design. Implementing large numbers of individually primitive processing elements directly 0073-1 129/9110000/0212$01 . OO 0 1991 IEEE 212 in VLSI hardware is intuitively appealing. There are two different approaches for supporting these models on parallel VLSI hardware [SI:
-* General-Purpose Neurocomputers : generalized, programmable neural computers for emulating a wide range of neural network models, thus providing a framework for executing neural models in much the same way that traditional computer address the problems of "number crunching".
-* Special-Purpose VLSI-Systems : specialized neural network hardware implementations that are dedicated t o a specific neural network model and therefore have a potentially higher performance than Neurocomputers. This paper is devoted to a special-purpose hardware implementation of a very simple associative memory loosely based on neural networks. The memory has a simple matrix structure with binary elements (connections, synapses) and performs a pattern mapping or completion of binary input/output vectors. To the authors knowledge, this comparatively simple model of a distributed associative memory was first discussed by Willshaw et. al. [31 in 1969. However, similar structures have been more generally discussed, e.g. by Kohonen [l] . The characteristics of the implemented model are described in section 2.
The important aspect for VLSI implementation of this simple memory model is the close relationship t o conventional memory structures. Hence, i t can be densely integrated and large scale memories with several thousands of columns (model neurons) can be realized with current technologies already. Furthermore, the regular topology results in a rigorous modularization of the system indespensible for a successful management of the design and test complexity of VLSI systems. In this respect a pure digital and a mixed analog/digital VLSI architecture are described in section 3 and discussed in section 4.
In general the basic operation of an associative memory is a certain mapping between two finite sets X and Y. In a more abstract sense these two sets may be regarded as questions and answers or stimuli and responses, both coded as vectors of numbers (Figure 1 ). Obviously, because of the above mentioned programming rule the memory matrix gets more and more filled (the connections will never be switched off). Consequently, the output might contain more 'l's than the desired output pattern. The chance that this kind of error will occur increases with the number z of stored pairs. This fact causes the following quantitative questions:
. i) How many patterns can be stored in an AM? ii) How many bits of information can be stored in an AM?
Both questions were answered by Palm [6] . Summarizing his results, an AM has its optimal storage capacity I for sparsely coded input/output patterns. This means, the number 1 (k) of active components Cl') in the input (output) patterns should be logarithmically related to the pattern length (n,m). Asymptotically, the optimal storage capacity for heteroassociation is given by [6]:
for n,m -> 00 and parameters:
Hence, the storage capacity I is proportional t o the number of storage elements n m . Furthermore, the number of patterns z that can be stored is much larger than the number of columns (artificial neurons). For example, an optimum of I=S93000 bits for n, m=1000, z=34780 (k=2,1=9) can be stored in the AM under the constraint that on the average 90% of the information of the output vector of each pair is stored [6] . Figure 4 shows the information I that can be stored in an AM as a function of the number of stored patterns (z) and Table 1 shows I as a function of the number of activated components in the input (1) and output (k) pattern respectively. Table 1 : Number of patterns that can be stored with low error probability and the corresponding storage capacity C as a function of parameters n,m,l (k = 3).
214
For the autoassociative case the optimal storage capacity is at most half the optimal capacity for heteroassociation E71. This is because autoassociation leads to a symmetric weight matrix and hends only half of the matrix is used for storing information.
Furthermore, it turns out that the AM works for pattern mapping applications in a more economic way compared t o conventional methods (e.g. hashing) and other neural network models, if the number of patterns is large and their individual information content small [71. These results encourage a hardware implementation in VLSI of this simple associative memory model in situations where such a mapping is a more natural way of storing information than a listing. Especially, because the AM works the more effectively the larger the matrix is [7] .
3 VLSI I m p l e m e n t a t i o n
The Associative Matrix can be handled most flexibly as a simulation program on a conventional computer (workstation), of course. It could be shown that even serial simulation of the AM would have to perform less operations than a conventional implementation in terms of bitwise mask operations C81. For the special case m<<n and a sparse matrix the serial implementation is even for a large number of patterns fast enough for certain applications [E] . But in general, a serial implementaton has a poor performance, especially if applied to larger matrices.
It is quite obvious that operation can be sped up considerably by parallel processing. The implementation by means of multiprocessor architectures (SIMD machines) is a promising compromise between flexible modelling -the system is still program controlledand a complete parallel processing of large matrices. In fact, at least two research groups have already designed a parallel associative computer (SIMD) based on a set of conventional microprocessors communicating via a common bus [9, 10] .
Consequently, the highest degree of parallelism is achieved by task-dedicated VLSI systems. It is well in the range of current technologies to implement an AM effectively on VLSI chips. Two different specialpurpose VLSI architectures have been designed for an AM so far at the University of Dortmund: a digital and a digitaVanalog implementation. Both of them will be discussed in this paper.
The system architecture in both cases is split up vertically into "slices"; each slice manages an equal number of columns. The slices are controlled by a conventional microprocessor (system control, Figure  5 ). distributing input data in an appropriate way to the slices and collecting output data from the slices. In consequence of the sparsely coded input/output patterns the microprocessor transfers and collects the ' patterns optimally by means of the addresses of the activated components. Hence, a transfer operation of a m-bit pattern takes only log(m) cycles and address lines in the serial case.
0 P Figure 5 : Partition of a mxn Associative Matrix into slices.
Dlgltal Implementation
In the case of the digital implementation the columns of an AM are controlled by a special slice chip comprising several very simple processing units (PUS). Each PU controls one column of the matrix and computes bit-serial the weighted sum (Equation 2) of the input pattern and the respective column. Because the input/output signals as well as the connection elements are binary, the basic building blocks of a PU are a counter and a comparator (Figure 6a ). The programming algorithm (Equation 1) for the connection matrix is realized by a simple OR-logic-block and is incorporated on the chip, too. The connection matrix can be built up by conventional RAMS (Figure 6b ).
In order to transfer the output pattern to the system control the addresses of the ' l ' s in the pattern are generated locally in the slice chips. All slice chips are connected t o a common bus and the access to the bus can be controlled by daisy-chaining or by an additional priority encoder logic. In case of the daisychaining method the time for transferring the output pattern is proportional t o the number of slice chips. In the other case the time is proportional to the number of active components Cl'). In both cases the transfer of the associated output pattern and the calculation of the weighted sum of inputs can be pipelined.
Up t o now several standard-cell-designs comprising 32 PUS (e.g. 2pm CMOS, 31mm2, *20000 transistors, 53 pads, lOMHz C111) and a full-custom-design comprising 128 PUS (2pm CMOS, 64mm2, *SO000 transistors; Figure 7 ) of a slice chip have been finished. Instead of realizing a whole chip in silicon we have first fabricated and tested successfully a single PU of the full-customdesign a t the University of Dortmund. The tested 6-Bit-PU is able to perform the calculations for the
Equations (1) - (3) With current submicron technologies it is possible to integrate 512 and more PUS on a single slice chip. Hence, the limiting factor for the number of PUS per chip is the pin count. Nevertheless, it is possible to have 256 PUS on a slice chip at least. The important disadvantage of this digital approach up to now is that most RAM chips have a one, four or eight bit organization whereas a longer word length (>16) is more appropriate for this approach. In order to get the highest degree of parallelism the optimal memory organization is m x u (m = number of matrix rows, U number of PUS per slice chip). For the above mentioned 8kx8k-AM RAM chips with a 8kx128 organization are required, for example. Therefore, the system architecture has to be slightly modified in order to make effective use of currently available memory chips. Work on this topic is done a t the moment.
Such an implementation performs a pattern mapping within 1 0 0~s . The association time is proportional to the number of ' l ' s in the input/output patterns (log(m) + log(n)) and hence independent of the number z of stored pairs. Table 2 shows some time estimations of the association time for a single pattern ( t A ) and the programming time for z pattern pairs (tp) depending on the number of '1's in the input and output patterns. The estimations are based on test results of the fabricated PU in combination with a static RAM (100 ns cycle time Table 2 : Association (t,) and programming ( tp) time estimations as a function of the parameters n,m,k,l,z.
Digital / Analog Implementation
The largest computational load implementing an AM is incurred by the weighted sum of input signals (Equation 2). Using analog circuit techniques [121 , this sum can be effectively computed by summing analog currents or charge packets, for example. In Figure 8 a simple circuit concept is proposed in CMOS technology. The matrix operation is calculated by current summing and the threshold operation is done by an analog voltage comparator.
The accuracy of analog circuits is not as high as for digital circuits, but they can be built much more compactly and they are more appropriate for the I .t highly parallel signal transfer operations immanent in neural networks. Note, however, that analog circuits are not so densely integrated as it may seem at first glance. They demand large-area transistors to assure an acceptable precision and to provide good matching of functional transistor pairs, as used in current mirrors or differential stages. Furthermore, analog circuits are influenced by device mismatches from the fabrication process and it is very difficult to control offset voltages, for example. Consequently, analog implementations should be applied to artificial neural networks requiring only modest precision. One example for such a network is the AM because there are only log(m) terms contributing to the weighted sum Si. The required accuracy of an AM is only about 4 to 5 bits even for large matrices h,n>10000) and hence in the range of analog circuit techniques.
The design of the connection element is based on conventional storage devices (ROM. RAM, EEPROM). For example, a conventional static memory cell has to be enlarged by two transistors, an EEPROM cell requires no additional transistors [13] . Hence, one million programmable connections can be integrated on one chip with current VLSI-techniques. The 8192x8192-AM requires 64 of such chips each comprising 128 columns.
Even more limiting t o the overall size of an AM slice than the area needed for the connections are the pin requirements of each slice. Taking advantage of the sparsely coded patterns permits an effectively serial as well as parallel transfer of the input/output patterns. The input/output organization in the serial case is similar t o that of the digital implementation (Figure 9a ). For a full parallel transfer of the patterns the m rows and n columns of the matrix are divided into g blocks of equal size. Under the assumption that at uppermost one component in each block is active, only m/g (1 of m/g)-decoders are needed for a full parallel transfer of the input pattern to the AM (Figure 9b) . We calculate the number of pins as: Because of the modular and regular structure of the architecture, the implementation of large AMs (n,m > 10.000) is feasible. A further attractive feature of the AM is its fault tolerance to defective connections. Even in the presence of 5% defects, up to 20.000 sparsely coded patterns (143, k=3) can be stored in a 1.000 x 10 00 AM with low error probability. Therefore, the AM will also be well adapted for the evolving wafer-scale-integration technique. Because of the modular and regular structure of the proposed architectures, the implementation of very large AMs (n,m40.000) is feasible. This aspect is very important for practical applications where the AM has t o be extended t o a useful number of storage elements. Work on possible applications of an associative memory of this type is done at the moment by different research groups, e.g. in the field of speech recognition, scene analysis and information retrieval.
Comparing both VLSI approaches presented above, we can call on efficient software tools for a fast, reliable and even complex digital system design. For the memory matrix we can use standard RAM chips employing the highest density in devices. In general, the matrix dimensions (n,m) can be extended by using additional RAM chips. An important disadvantage up t o now is that most RAM chips have a one, four or eigth bit organization whereas a longer word length (>16) is more appropriate for the digital implementation.
On the contrary, the design of analog circuits demands much more time, good theoretical knowledge about transistor physics and a heuristic experience of layout. Only a few process lines are characterized by analog circuits. The noise immunity and precision is low compared to digital circuits. The fixed matrix dimensions are a further disadvantage. In their favor, we point out that analog circuits can be built much more compactly and are more appropriate for the highly parallel signal transfer operations immanent in neural networks. For example, a 1.OOOx1.OOO AM can be integrated on one chip, whereas the digital concept requires several slice and RAM chips. In conclusion, both approaches have their advantages and it remains t o be seen which type of implementation will be more effective in certain applications.
