This paper describes a generic and fast classifier that uses a binary CMM (Correlation Matrix Memory) neural network for storing and matching a large amount of patterns efficiently, and a k-NN rule for classification. To meet CMM input requirements, a robust encoding method is proposed to convert numerical inputs into binary ones with the maximally achievable uniformity. To reduce the execution bottleneck, a hardware implementation of the CMM is described, which shows the network with on-board training and testing operates at over 200 times the speed of a current mid-range workstation, and is scaleable to very large problems. The CMM classifier has been tested on several benchmarks and, comparing with a simple k-NN classifier, it gave less than 1% lower accuracy and over 4 and 12 times speed-ups in software and hardware respectively.
Introduction
Desirable characteristics of Correlation Matrix Memory (CMM) neural networks include simple and quick training, and highly flexible and fast search ability [1] . Whereas most neural networks need a long iterative training times, a CMM is trained using an oneshot storage mechanism and simple binary operations. The CMM has been used as a match engine in a number of successful applications, e.g. symbolic reasoning in the AURA (Advanced Uncertain Reasoning Architecture) approach [2] , chemical structure match [4] and post code matching. This work investigates its use for pattern classification tasks. It is known that the k-NN rule [5] is applicable to a wide range of classification problems. However, this method is too slow to use for many applications with large amounts of data. To speed up, previous researchers have considered reducing training data [6] and improving computational efficiency via complex pre-processing of training data [7] . In contrast to these, a CMM is a simple, general and powerful approach which can be used to store a large number of training patterns efficiently, and to retrieve both exact and near matches quickly for a test pattern. Therefore, the combination of CMM and k-NN techniques may result in a generic and fast classifier.
For most classification problems, patterns are in the form of multi-dimensional real numbers, and appropriate quantisation and encoding are needed to convert them into binary inputs to a CMM. A robust quantisation and encoding method is developed to meet requirements for CMM input codes, such as uniformity, orthogonality and sparseness [3] , and to overcome the common problem of identical data points in many applications, e.g. background of images or normal features in a diagnostic problem.
The execution of the CMM was quickly identified as the bottle neck in the processing by an analysis of the AURA [2] method. To reduce this bottleneck, the CMM has been implemented in dedicated hardware, that is the PRESENCE architecture. The primary aim is to improve the execution speed over conventional workstations in a cost effective way. This work was also motivated by the needs of many research projects applying the CMM to commercial problems mentioned above.
The next section discusses the CMM for pattern classification and the robust uniform (RU) encoding method, followed by descriptions of the PRESENCE architecture (the hardware implementation of the CMM). Experimental results are presented in Section 4, and concluding remarks in the last section. Figure 1 shows the architecture of the CMM classifier. The RU encoder (as detailed in 2.2) quantises numerical inputs and generates binary codes; the CMM engine stores training patterns and matches stored patterns close to a test pattern to supply to a conventional k-NN module for classification. Both the CMM and k-NN modules are needed as the CMM is fast but produces spurious errors as a side effect [3] . These are removed through the application of the k-NN rule. More specifically, the speed of the classifier benefits from the use of the CMM for fast training and matching to pre-select a sub-set patterns from a large amount of training data; the accuracy gains from the application of the k-NN rule to the sub-set in the original space to reduce information loss and noise in the encoding and match processes. 
CMM for Pattern Classification

Pattern Match and Classification with CMM
In the CMM there is a binary matrix M and, prior to any learning, all of its elements are set to '0'. In a training process a unique binary vector (or separator as often called) s i is generated to label an unseen input binary vector p i ; the CMM learns through the association of the two vectors by performing the following logical ORing operation,
In a recall process, for a given test input vector p k , the CMM performs, Patterns preselected by CMM k-NN tt CMM pattern store and match engine
RU encoder
Training/test data
followed by thresholding v k and recovering individual separators using a MBI (Middle Bit Index) method [2] . For speed, it is appropriate to use a fixed thresholding method and the threshold is set to the level equal to the number of '1' bits in the input pattern to allow exact match, or a low level to match a proportion of the input pattern as detailed below.
To understand the recall properties of the CMM, consider the case where a known pattern p k is represented, then Equation 2 can be written as,
where n = and p k have some common components. Therefore, v k also contains separators for partially matched patterns, and these separators can be obtained at lower threshold levels. This partial or near match property is useful for pattern classification as it allows the retrieval of stored patterns which are close to the test pattern in Hamming distance.
From those training patterns matched by the CMM engine, a test pattern is classified using the k-NN rule. Distances are computed in the original input space to minimise the information loss due to quantisation and noise in the above match process. As the number of matches returned by the CMM is much smaller than the number of training data, the distance computation and comparison are dramatically reduced compared with the simple k-NN method. Therefore, since the CMM stage is very fast, the CMM based k-NN classifier can be faster.
Robust Uniform Encoding
In addition to the above sparseness and orthogonality, another primary requirement for CMM input codes is that they should be distributed as uniformly as possible in order to avoid some parts of the CMM being used heavily while others are rarely used. Figure 2 shows three stages of the encoding process, that is quantising d-dimensional real numbers, x i , generating sparse and orthogonal binary vectors, c i , and concatenating them to form a CMM input vector.
The code uniformity is met at the quantisation stage. For a given set of N training samples in some dimension (or axis), it is required to divide the axis into N b small intervals, called bins, such that they contain uniform numbers of data points. As the data often have a non-uniform distribution, the sizes of these bins should be different. It is also quite common for real world problems that many data points are identical. For instance, there are 11%-99.9% identical data in benchmarks used in this work. Our robust quantisation (RQ) method described below is designed to cope with the above problems and to achieve a maximal uniformity. and the above partition process may be repeated to increase the uniformity. Boundaries of bins obtained become parameters of the encoder in Figure 2 . Sizes of bins (or the number of bins) determine the match 'neighbourhood' since samples falling in the same bin have the same quantised value. In an extreme case when N b 1 , the CMM matches all training samples for any test data and our complete system becomes a standard k-NN classifier. In general it is appropriate to chose N b such that each bin contains a number of samples, which is larger than k nearest neighbours for the optimal classification.
3 The PRESENCE Architecture
Architecture design
Some of important design decisions for implementing the CMM were: the system should use cheap memory, and should not attempt to embed both the weight storage and the training and testing in hardware (VLSI). This arises because the applications commonly use CMMs with over 100Mb of weight memory, which would be difficult and expensive to implement in custom silicon. The system must be hosted on industry standard buses to allow widespread application, thus VME and PCI were chosen.
The PRESENCE architecture implements the control logic and accumulators necessary to implement the core of the CMM. As shown in Figure 3a the CMM takes a
set of binary inputs within the input pattern. Each input selects rows from the CMM that will be added into the accumulators (note that the input x the weights operation is implicit in this process). The accumulated data is then thresholded using L-max [8] or fixed global thresholding. Finally, the data is then returned to the host for further processing. The outline of the PRESENCE architecture is shown in Figure 3b . The architecture consists of a bus interface, a buffer memory which allows interleaving of memory transfer and operation of the PRESENCE system, a SATCON and SATSUM combination that accumulates and thresholds the weights. The data bus connects to a pair of memory spaces, each of which contains a control block, an input block and an output block. Thus the PRESENCE card is a memory mapped device, that uses interrupts to confirm the completion of each operation. To maintain an efficient use of input memory the bits that are set to one in the input, p, are passed to the processor card as 'index values' one for each bit set. For efficiency, two memory input/output areas are provided so that one can be acted on from the external bus while the other is used by the card. The control memory input block feeds to the control unit, which is a FPGA device programmed to carry out all necessary operations. The input data (index values) are fed to the weights on the card and the area of memory that is read is then passed to a block of accumulators. In our current implementation the data width of each FPGA device is 32 bits, which allows us to add a 32 bit row from the weights memory in one cycle per device. Currently we have 16Mb of 20ns static memory implemented on the VME card, and 128 Mb of dynamic (60ns) memory on the PCI card. The accumulators are implemented along with the thresholding logic on another FPGA device (SATSUM).
To enable the SATSUM processors to operate faster, a 5 stage pipeline architecture was used. The stages of which are; index value count: latch address into buffer memory: add index to memory offset: latch result of index calculation: access the weights memory with the address. The use of this pipeline reduces the data accumulation time from 175ns to 50ns. All PRESENCE operations are supported by a C++ library that is used in all AURA applications.
The design of the SATCON allows many SATSUM devices to be used in parallel in a SIMD configuration. The VME implementation uses 4 devices per board giving a 128 bit wide data path. In addition the PCI version allows daisy chaining of cards allowing a 4 card set for a 512 bit wide data path. 
Memory interface
The complete VME card assembly is shown in Figure 4 . The SATCON and SATSUM devices are mounted on a daughter board for simple upgrading and alteration. The weights memory, buffer memory and VME interface are held on the mother board. 
Performance
By an analysis of the state machines used in the SATCON device the time complexity of the approach can be calculated. 
A comparison with a Silicon Graphics 133MHz R4600SC Indy is given in Table 1 . This shows the speed up of the matrix operation (Equation 2) for our VME implementation (128 bits wide). The timings are for a fixed threshold. The values for processing rate are given in millions of binary weight additions per-second (MW/s). In this implementation the system cycle time needed to sum a row of weights into the counters (i.e. time to accumulate one line) is 50ns for the VME version and 100ns for the PCI version. In the PCI form, we will use 4 closely coupled cards, which result in a speed-up of 432.
Platform
Processing The build cost of the VME card was half the cost of the baseline SGI indy machine given above, when using 4Mb of 20ns static RAM. In the PCI version the cost is greatly
reduced through the use of dynamic RAM devices allowing a 128Mb memory to be used for the same cost, allowing only a 2x slower system with 32x as much memory per card (note that 4 cards used in table 1 hold 512Mb of memory). The training and recognition speed of the system are approximately equal. This is particularly useful in on-line applications, where the system must learn to solve the problem incrementally as it is presented. In particular, the use of the system for high speed reasoning allows the rules in the system to be altered without the long training times of other systems. Furthermore our use of the system for a k-NN classifier also allows high speed operation compared with a conventional implementation of the classifier, while still allowing very fast training times.
To appreciate the utility of our implementation consider its use as a pattern recognition system for a small mobile robot that must follow a pre-planned route. The aim is for the robot to follow the route using the images it captures on a previous guided tour of the route. To do this we use the N tuple pre-processing method [9] . This method takes an image frame and performs simple feature analysis on the image which is passed to the CMM as a vector containing a fixed number of bits set to one, each bit represents a feature that it has found in the image. Consider a camera, taking images at 25 frames per second, at a resolution of 512 2 . If the image is sampled at 10% with an N tuple size of 4, then 6553 features ('tuples' in N tuple terminology) will be sampled and passed to the CMM in a vector of size 104856 bits. If a separator is used per image frame, and each separator is unique and has 2 bits set, then a separator of size 10240 will allow 20480 separators to be stored using a memory of size 128Mb (the PCI memory size). At 25 frames per second, this allows the robot to store images for 13 minutes. Recognition of the frames can be also be performed at frame rate. Using the 4 card PCI implementation, almost an hour of frame rate video can be stored and recognised. For guidance, the robot stores the direction along with each video image of the scene ahead as it is hand guided through the environment. In recognition, the recogniser finds the image that best matches the current view and recalls the guidance information. This shows the potential use of the technology in a novel application which is difficult to achieve in a cost effective way by any other method.
Results on Benchmarks
Performance of the robust quantisation method and the CMM classifier have been evaluated on four benchmarks consisting of large sets of real world problems from the Statlog project [10] , including a satellite image database, letter image recognition database, shuttle data set and image Segmentation data set. To visualise the result of quantisation, Figure 5a shows the distribution of numbers of data points of the 8 th feature of the image segment data for equal-size bins. The distribution represents the inherent characteristics of the data. Figure 5b shows our robust quantisation (RQ) has resulted in the uniform distribution desired.
We compared the CMM classifier with the simple k-NN method, multi-layer perceptron (MLP) and radial basis function (RBF) networks [11] . The performance of interest are classification rate (c-rate) on test data sets and relative speed (r-speed). In the evaluation we used the CMM software libraries developed in the project AURA at the University of York. It is appropriate to set 1-3 '1' bits in input vectors and separators. Experiments were conducted to study influences of a CMM's size on c-rate and r-speed measured against the k-NN method (as shown in Figure 6 ), where the rspeed of the CMM classifier includes the encoding, training and test time. The effects of the number of bins N b on the performance were also studied (Figure 7 ). Choices of the CMM size and the number of bins N b may be application dependent, for instance, in favour of the speed or accuracy. In the experiment it was required that the r-speed is not 4 times less and c-rate is not 1% lower than that of the k-NN method. Table 2 contains the speeds of the four methods relative to the recall speed of the CMM on the four benchmarks. It is interesting to note that the recall speeds of MLP and RBF networks were 1~25x faster than that of the CMM classifier, but their training speeds were several hundreds times slower. The k-NN method needed no training and had the recall speeds 0.043~0.176 times that of the CMM classifier. The overall speed (including training and recall time) of the CMM classifier is over 4 times that of the k-NN method. When using the PRESENCE, i.e. the dedicated CMM hardware, the speed of the CMM was further increased over 3 times.
The classification rates by the four methods are given in Table 3 , which shows the CMM classifier performed less than 1% less accurate than the k-NN method.
The 'two-spirals' benchmark in Figure 8a is interesting as this highly non-linear problem is extremely hard for back-propagation networks and relatively easy for an RBF or Cascade-Correlation net [12] . We found that this task was extremely easy for the CMM. Figure 8c shows that a CMM correctly discriminated all data points, including training and unseen ones. 
Conclusions
In this paper we have presented a classifier, which uses a binary CMM for storing and matching a large amount of patterns efficiently, and the k-NN rule for classification. The RU encoder converts numerical inputs into binary ones with the maximally achievable uniformity to meet requirements of the CMM. Experimental results on the four benchmarks show that the CMM classifier, compared with the simple k-NN method, gave slightly lower classification accuracy, less than 1%, and over 4 times speed-ups in software and 12 times speed-ups in hardware. Therefore our method has resulted in a generic and fast classifier. Compared with MLP and RBF networks, the CMM needs a very short training time. When new training data arrive in an incremental way, MLP and RBF nets needs to be retrained, but with the CMM, the new samples can be simply added to the memory. This paper has also described a hardware implementation of a FPGA based chip set and a processor card that will support the execution of binary correlation matrix Table 3 Classification rates of four methods on four benchmarks memories. It has shown the viability of using a simple binary neural network to achieve high processing rates. The approach allows both recognition and training to be achieved at speeds well above two orders of magnitude faster than conventional workstations at a much lower cost than the workstation. The system is scaleable to very large problems with very large weight arrays. Our current research is aimed at showing that the system is scaleable, evaluating methods for the acceleration of the pre-and post processing tasks and considering greater integration of the elements of the processor through VLSI. For more details of the AURA project and the hardware described in this paper see our web page (http://www.cs.york.ac.uk/arch/nn/aura.html).
