This paper describes the C-NNAP machine, a MIMD implementation of an array of ADAM binary neural networks, primarily designed for image processing. C-NNAP comprises an array of VME cards each containing a DSP, SCSI controller and a new design of the SAT peripheral processor. The S A T processor is a dedicated hardware implementation that performs binary neural network computations. The SAT processor yields a potential speed-up of between 108 times to 182 times that of the current DSP with its dedicated coprocessor. C-NNAP in association with the SAT provide a fast, parallel environment for performing binary neural network operations.
Introduction
Binary neural networks based on the N-tuple method [6] have been used ir, image processing for pig evisceration [l] , scene analysis [53 and they also have potential for use in knowledge manipulation [3]. The binary neural network described in this paper is the Advanced Distributed Associative Memory (ADAM) developed by Austin [5] . An obstacle to the use of ADAN in real-time image processing systems is the requirement to process large quantities of data. To implement a hzgh performance parallel version of ADAM we have developed the Cellular Neural Network Associative Processor, C-NNAP which is a MIMD machine with dedicated hardware assistance. The use of the C-NNAP machine for object recognition is explained in [a] .
In the ADAM recall phase the majority of the computational effort is spent performing binary matrix multiplications that are implemented through binary summations. Although a coprocessor had previously been constructed to assist with the summing [lo] , the implementation has limited functionality. To improve the functionality the Sum And Threshold processor. SAT Version I, was developed [8] . Analysis of this design highlighted a number of limitations that have been overcome in the design described in this paper, SAT Version 2. Because ADAM is a superset of the basic N-tuple method the SAT processor can also be used as a dedicated N-tuple pattern recognition processor. Section 2 of this paper explains the recall phase of the ADAM algorithm, Section 3 gives an overview of the C-NNAP machine and Section 4 is a detailed explanation and analysis of the SAT processor.
The ADAM Algorithm
The ADAM algorithm is a neural associative memory that has been used in a wide range of image analysis tasks [4] . The advantages of neural associative memories over traditional content addressable memories is that they can operate on noisy data. The ADAM algorithm is a significant improvement over other neural associative methods, ie. Willshaw [ll] , as it provides improved speed of operation, image storage ability and generalisation properties [5] . A complete description of the operation of the ADAM can be found in [5] . The SAT processor is primarily involved with the recall phase of the ADAM neural network as shown in Figure 1 . The operation is initiated by applying the input image t o the tuple units. The tuple units perform a n input t o 2" outputs logic decoder operation (72 is 2 in the Figure) to activate a Figure 1 ). The class pattern is applied to the second stage matrix which activates links (shown with a bold line) which are summed in an analogous way to stage one. These summed values are thresholded using Willshaw thresholding, ie. those summed values which are equal to the number of class bits set (see Section 4) are set to 1, all others are set to 0. The output from the thresholding stage is passed to logic encoders whose output is the recalled image.
The Cellular Neural Network Associative Processor
C-NNAP is a VME based system with three basic building blocks: the S-Node, C-Node(s) and the Inode. The controlling workstation is the S Node (Supervisor node). The S node controls C node(s) (C-NNAP nodes) by providing them with address information and control programs. The I-node (Information node) is the 1/0 memory and data acquisition system from which the S-Node reads in the input data.
The architecture of a C node is shown in Figure 2 which has been designed to allow pipeline execution of the ADAM network. The daughtei boards which hold the DSP and S.4T processors have been designed to allow them to be upgraded independently of the C nodes. The C node has three, 25ns static RANI memories: Weights Memory: The weights memory is divided into two independently accessible areas. The first area is used by the DSP to store the weights after their calculation whilst the second area contains the weights that the SAT processor will use during the recall operations. The switching between memories is controlled by the DSP. Buffer Memory: The buffer memory is also divided into two independently accessible areas. The first area is used to store the non-tupled image prior to processing by the DSP and the tupled data prior to switching the memory into the SAT address area. This area can also be accessed by other nodes on the VME bus. The second area is used by the SAT as temporary storage and for storing results. The switching between memories is again controlled by the DSP. DSP Memory: Only the DSP has access to the DSP memory which it uses as a program store and temporary storage. DSP Daughter Board: The DSP daughter board hosts an AT & T DSP32C clocked at 50 MHz. The DSP32C can be considered a slow processor, but the processing power required of the DSP is not extensive as the computationally intensive work is done by the SAT processor.
The Sum And Threshold Processor
This Section describes Version 2 of the S;1T processor that performs all of the operations within the dotted box of Figure 1 . The SAT processor is implemented using an Xctel A1280XL FPGA and an Actel 81425-A FPGX. The SAT is a peripheral processor that operates in parallel with the DSP thus releasing it to perform preprocessing and data movement operations. This Version of the SAT processor is significa,ntly different to the Version 1 design [8] as it overcomes the stage two summing bottleneck and it no longer uses a local memory. The SAT has a.ccess to the weights memory that cont,a,ins the correlation matrices and the buffer memory that holds the control information, tuple pointers, summed values and recalled patterns. The block diagram of the SAT processor is shown in Figure 3 . Thresholding begins by inspecting all the summed values to find the maximum value stored, this becomes the current threshold value. All of the summed values are then checked, when a summed value equals the current threshold value this indicates that the corresponding class bit should be stored for use by the stage two summing controller. If insufficient class bits were found after the first thresholding iteration then the operation is repeated by finding the next highest summed value and using this as the new threshold value. This is repeated until all L class bits have been recovered.
In Version 1 of the SAT all the class bits were first saved, then during stage two summing all the class bits were examined to determine the weights to sum. This was the bottleneck discussed in the introduction. In general, L is chosen to be logan' of the class size [ll] , which means that relatively few bits are set and most of the class bits examined will be zero. To take advantage of this, Version 2 of the SAT stores the relative address of the class bit instead of the class bits themselves, as proposed in [7] . This is shown in Figure 6 where the values 0, 3, 9 etc. are stored in the buffer memory instead of the entire class pattern. 
SAT Processor Comparison
Using Equation 1 the new SAT was compared with Version 1 to produce the graph of Figure 7 . To produce the graph a tuple size of four was used (a typical size used in many applications [9] ), the number of iterations was one. The graph shows the speed of operation of SAT Versions 1 and 2, for a range of input sizes and for class sizes of 50, 100 and 150 bit class sizes, with the number of class bits set to 1 o g~C L A S S S I Z E .
In order to show the design benefits of the new SAT processor instead of just the improvements in technology used (ie. faster memory) the old design speeds have been shown scaled as if it was also using the new 2511s memory Stage Two Summing and Thresholding The basic hardware for stage two is the same as that used for the stage one summing. The only significant difference between the two operations is that the summed values are not written to the buffer memory unless the user specifically requests it.
4.1, SAT Performance Evaluation
The timing evaluation has been performed by analysing the state machine behaviour to derive Equation l which can be used to determine the execution time for any input data using the following variables: Figure 7 it is calculated that Version 2 of the SAT ranges between 2.25 times faster than Version 1 for a class size of 50 and 3.8 times faster than Version 1 for a class size of 150. Kennedy [7] showed that Version 1 of the SAT was 48 times faster than the standard DSP and its dedicated coprocessor. This suggests that Version 2 of the SAT processor is now between 108 times and 182 times faster than the DSP with its dedicated coprocessor. It is interesting to consider how large an image could be processed using 25Hz frames per second (fps) input images on a single C node. It was shown in 171 that the maximum size input image that the DSP with the old dedicated coprocessor could process is 22 x 22 pixels at 25 fps using a tuple size of four and a class size of 32 bits; note that the DSP would also have to perform pre-processing operations, such as the tupling, that would further reduce the class size or the image size. The new version of the SAT processor can theoretically process an input image of 230 x 230 pixels, while freeing the DSP to perform the pre-processing and memory [E ( 7 . 3 4 ) 
Execution Time(seconds) = 50nanoseconds x transfer operations. This is considered a significant improvement.
Benchmarking the SAT Processor
To appreciate the benefits of the SAT processor it has been compared with a typical fast workstation, a Silicon Graphics R4600SC Indy workstation that has a 512K secondary cache and a processor clock speed of 133 MHz. The workstation takes 1790ns to sum 32 bits of matrix data whilst the SAT takes 350ns, making the SAT faster than the workstation by a factor of 5 times. This is a large degree of speed-up considering that the SG workstation costs around 27000 per unit whilst the S.4T daughter board costs in the region of 2300. The SA4T would benefit from implementation in VLSI as this would allow the data width t o be increased to 32 or 64 bits providing an immediate speed-up over the workstation of 10 or 20 times respectively. VLSI would also allow the SAT clock speed to be greatly increased.
. Conclusion
This paper has described Version 2 of the C-NNAP parallel processor and Version 2 of the SAT processor. It has been seen that the SAT operates between two and four times faster than Version 1 of the SAT processor. The analysis has shown that, the DSP with coprocessor can process a 22 x 22 image whilst SAT Version 2 can process a 230 x 230 pixel image, considering 25 frames per second image processing. These improvements will allow the users of the C-NNAP machine t o develop more sophisticated real-time image processing applications than was previously possible.
