This paper presents the work that has resulted in the SAT processor; a dedicated hardware implementation of a binary neural image processor. The SAT processor is aimed speci cally at supporting the ADAM algorithm and is currently being integrated into a new version of the C-NNAP parallel image processor. The SAT processor performs binary matrix multiplications, a task that is computationally complex for a CPU with a standard instruction set. It can perform the matrix multiplication and thresholding between 100 and 200 times faster than the DSP32C that uses an in-house produced dedicated coprocessor. This speed-up will allow the SAT to process images of up to 220 x 220 pixels at 25-Hz frame rates.
Introduction
A signi cant obstacle to the use of binary associative memories in real-time image processing systems is the large quantity of data that requires processing. To overcome the processing bottleneck the Advanced Computer Architecture Group of the Department of Computer Science has built a Cellular Neural Network Associative Processor (C-NNAP) machine which is aimed primarily at fast processing of ADAM (Advanced Distributed Associative Memory) data. ADAM is an associative memory algorithm developed by Austin and Stonham 3] for use in scene analysis. The C-NNAP machine is an array of processor cards working in parallel with data being exchanged between cards. This provides a distributed solution to ADAM and other image processing problems. The use of the C-NNAP machine is explained in Austin's paper in these proceedings 1].
Analysis of the ADAM algorithm has shown that the majority of the computational e ort is spent performing binary matrix multiplications. Because the matrix contains only binary elements the matrix multiplication is implemented through large binary summations. Although the group have constructed a coprocessor to assist with the summing, the implementation has limited functionality. The most signi cant limitation is that as a coprocessor it is fully controlled by the DSP. This means that all data to be summed must be written to the coprocessor and all results read from it, thus using valuable DSP resources. To overcome these problems a peripheral processor has been developed, known as version 1 of the SAT (Sum And Threshold) processor 5]. Analysis of the version 1 design has highlighted design changes that have been incorporated into the design described in this paper, SAT version 2.
The following sections present version 2 of the Sum and Threshold (SAT) processor which is a signi cant improvement on version 1. Section 2 describes the operations performed by the SAT processor, Section 3 describes the SAT processor design whilst Section 4 evaluates the SAT design and its performance.
The ADAM Algorithm
The ADAM algorithm has been used for recognising features in images in order to extract features from maps such as roads and urban areas 2]. The advantage of associative memories over traditional methods of storing images in a content addressable memory (CAM) is that they have fast access times and can operate on noisy data. The ADAM algorithm is a signi cant improvement on other associative methods, ie. Willshaw 6] , as it provides improved speed of operation, image storage ability and generalisation properties 3].
The explanation that now follows describes the recall operation used in the ADAM memory. A full description of the operation of the ADAM can be found in 3]. The recall operation is shown in Figure 1 and the SAT processor performs all of the operations within the dotted box.
The recall operation is as follows:
1. The input image is applied to the tuple units. Figure 1 ). 5. The class pattern is applied to the second stage matrix which activates links (shown with a bold line) which are summed in the same way as item 3. 6. These summed values are thresholded using standard Willshaw thresholding, ie. those summed values which are equal to the number of class bits are set to 1, all others are set to 0. 7. The outputs from the thresholding are passed to logic encoders to provide the recalled image.
3 The SAT Processor
Hardware Introduction
Version 2 of the SAT processor is signi cantly di erent to the version 1 design as it overcomes the stage two summing bottleneck and it no longer uses a local memory 5].
The SAT processor has access to two memories:
1. A weights memory that contains the correlation matrices. 2. An IO memory that holds the control information, tuple pointers (see section 3.2), summed values and recalled patterns.
As a peripheral processor the SAT operates in parallel with the DSP thus releasing it to perform preprocessing and data movement operations. The SAT has 16 sixteen-bit registers (counters) for summing. The counters operate in parallel, thus allowing sixteen bits of the matrix to be summed in one clock cycle.
The SAT processor has an interrupt handler to acknowledge the interrupt from the controlling DSP used to start the SAT processor. On completion of this task control is passed to a loader that reads control data from the IO memory. The control data consists of locations of matrices, where to place results, the number of rows, threshold levels and operations required. Once loading is completed control is passed to the main state machines that perform the summing and thresholding of the ADAM algorithm.
Stage One Summing
The matrix is stored as sixteen bit rows as shown in Figure 2 in order to access the data in the format required by the 16 summing counters. Figure 1 shows that the SAT needs to know which lines of the stage one matrix require summing, this information is called the tuple pointers and is stored in the IO memory. The tuple pointers are generated by the DSP from the input image. To calculate the exact address of the weights an o set is provided in the control data. When the tuple pointer values are added to the o set the correct row and line of weights is accessed. When the SAT sums the next row of the matrix it adds the length of the row in the matrix to the original o set and this results in the new o set for the next row.
An example of this is shown in Figure 2 where the lines 1, 3, 6, 8 etc are the lines to be summed relative to the o set, eg. the rst set of weights to be summed are at location 0x2001 (the o set + 1), the next weights are at 0x2003, etc. At the end of the row the length of the row is added to the o set, in this case N, to give the new start location. Figure 3 shows the hardware used to perform weights addressing and its relationship with the two memories. The weights address calculation has the longest propagation delay of the SAT operations and has an estimated delay of 120ns. In the evaluation section (Section 4.1) this is referred to as a LONG cycle. 
Stage One Thresholding
When all the summed values have been written to the IO memory, l-max thresholding (section 2) is applied to them. The hardware used for this is shown in Figure 5 . Thresholding begins by inspecting all the summed values to nd the maximum value stored and this becomes the current threshold value. All of the summed values are then checked and where a summed value equals the current threshold value this indicates that the corresponding class bit should be set. 
VALUE
The recalled class bits need to be stored for use by the stage two summing controller. In version 1 of the SAT all the class bits were rst saved, then during stage two summing all the class bits were examined to determine the weights to sum. This was the bottleneck discussed in the introduction. In general l is chosen to be log 2 N of the class size 6] which means that relatively few bits are set and most of the class bits examined will be zero. To overcome this problem version 2 of the SAT stores the relative address of the class bit instead of the class bits themselves, as proposed by Kennedy 4] in the analysis of the version 1 SAT processor. This is shown in Figure  6 where the values 0, 3, 9 etc are stored in the IO memory instead of the entire class pattern. If insu cient class bits were found after the rst thresholding iteration then the operation is repeated by nding the next highest summed value and using this as the new threshold value. This is repeated until all l class bits have been recovered.
Stage Two Summing and Thresholding
The basic hardware for stage two is the same as that used for the stage one summing. The only major di erence between the two operations is that the summed values are not written to the IO memory unless the user speci cally requests it.
The thresholding required is an equality comparison, ie. does the summed value equal the number of class bits set ?.
Implementation
The SAT processor uses an Actel A1280A FPGA and an Actel A1425A FPGA. The PCB for the SAT is a daughter board that plugs into the C-NNAP processor card.
4 Performance Evaluation
Timing Expressions
The timing evaluation has been done using the state machine behaviour to produce equations 1, 2 and 3. These equations can be used to determine the execution time for any data size. As stated in section 3.2 the slowest operation in the SAT processor is the weights calculation (see Figure 3 )and is called a LONG (L) clock cycle. All the other clock cycles are called a NORMAL Cycle (N). L and N are needed for the following equations:
Stage One Summing Equation: Equation (1) was used to calculate the time spent performing the stage one summing (S1). S1IS is the stage 1 input image size, S1TS is the stage 1 tuple size and CS is the class size. S1 = CS 16 S1IS S1TS (N + L) (1) Stage One Thresholding Equation:
The equation to calculate the time spent performing the stage one thresholding (S1T ) is equation (2) where CS is the class size and SV EQTV is the number of times that a summed value is equal to the stored threshold value in an iteration. I is the number of iterations required to nd all l bits of the class pattern. S1T = I(4 CS N + 2 SV EQTV N) (2) Stage SAT Performance Equation:
Equations (1), (2) and (3) give the total execution time for all the SAT operations, this is equation (4) . SATexecutiontime = S1 + S1T + S2 (4) 
SAT Analysis
Using equation (4) the new SAT was compared with version 1 to produce the graph of Figure 7 . To produce the graph a tuple size of four was used (a typical size used in most applications). The number of iterations for the stage one thresholding was limited to one iteration. It was estimated from simulations that the LONG clock would be 150ns and a NOR-MAL clock would be 50ns. The graph shows the speed of operation of SAT versions 1 and 2, for a range of input sizes and for class sizes of 50, 100 and 150 bit classes. The number of class bits set was log 2 N. In order to show the design bene ts of the new SAT processor instead of just the improvements in technology used (ie. faster memory) the old design speeds have been shown as if it was also using the new 25ns memory. This means that the graph represents a fair comparison of the two designs independent of technology used. From the results used to generate Figure 7 it was calculated that version 2 of the SAT ranges between 2.25 times faster than version 1 for a class size of 50 and 3.8 times faster than version 1 for a class size of 150. Kennedy 4] showed that version 1 of the SAT was 48 times faster than the standard DSP and its dedicated coprocessor. This suggests that version 2 of the SAT processor is now between 108 times and 182 times faster than the DSP with its dedicated coprocessor.
Design Evaluation
Version 2 of the SAT processor has been analysed to determine where any remaining bottlenecks may be in the design. The amount of time spent in each summing state machine is shown in Figure 8 which indicates the time spent in the summing stages as a percentage of the total execution time. The time spent performing the stage one thresholding is minimal compared with the other execution times and has therefore not been included in the gure. SC and LC refer to small class (32 bit) and large class (1024 bit) patterns, LI and SI refer to small input image (4096 bits) and large input image (262144 bits, which is equivalent to a 512 by 512 image). The stage one/large class bottleneck could be overcome by providing all the tuple pointers to all the weights required instead of those for just one row. This was the method used in Version 1 of the SAT and would eliminate the addition of the pointer to the o set, thus reducing the LONG propagation delay. However, the inclusion of the addition was an intentional design decision in order that the DSP performs fewer calculations. This means that the DSP can be applied to the preprocessing of data and data movements which (at this stage) are considered to be more important. There seems little point in producing an extremely e cient peripheral processor if it has to wait for data before it can perform its calculations. Also, the addition delay is only in the region of 20ns and its removal would not result in a large increase in data throughput. Whether this is a good design decision can only be con rmed by testing the new C-NNAP machine which will be the basis of future work.
Application Performance
Figure 7 was analysed to produce 25Hz frame rate data throughput gures. In 4] it was shown that the maximum size input image that the DSP with the dedicated coprocessor could process is 22 x 22 using a tuple size of four and a class size of 32 bits. The DSP would also have to perform pre-processing operations such as the tupling that would further reduce the class size or the image size. The new version of the SAT processor can theoretically process an input image of 220 x 220 while freeing the DSP to perform the pre-processing and memory transfer operations, which is considered a signi cant improvement.
Testing
Fully testing the SAT processor required the use of two VHDL programs that emulated the characteristics of the weights and shared memory. All of the tests produced the expected results when simulating the SAT using actual delays.
6 Conclusion This paper has described version 2 of the SAT processor. The SAT operates between two and four times faster than version 1. The speed increase is due to design improvements and not merely the application of new technology. The analysis showed the following improvement: the DSP with coprocessor can process a 22 x 22 image, SAT version 1 can process a 144 x 144 image and SAT version 2 can process a 220 x 220 image at 25 frames per second. This improvement will allow the users of the C-NNAP machine to work on signi cantly larger images than was previously possible.
