A word-parallel digital associative engine with accurate and widerange Manhattan-distance computation is presented. It performs a continuous search operation to detect not only the nearest-match data but also all data in the sorted order of the exact Manhattan distance. The word-parallel digital implementation using a hierarchical search path provides a high-speed search operation with faultless precision, a low-voltage operation mode, and a potential capability of unlimited data capacity. Word-parallel distance calculation circuits autonomously count the Manhattan distance using a weighted search clock to detect the nearest-match data. An associative engine, with 64 words of 8 bit x 32 element, has been fabricated using a 0.18 pm CMOS process and successfully tested. The worst-case search time of all data sorting takes 5.85 ps at a supply voltage of 1.8 V.
Introduction
Associative processors based on content addressable memories (CAMS) have been proposed for various applications such as pattern recognition, data compression and intelligent processing to reduce the considerable memory access and processing time [ 11-[6] . Some fully-parallel processors [ 1]- [4] employ Hamming distance for associative processing since Hamming distance estimation is realized by less computational effort than Manhattan distance. On the other hand, associative processing based on Manhattan distance is capable of many practical applications such as vector-quantization recognition [5] , code-book-based image compression [6] and so on as shown in Fig. 1 . Although associative processors based on Hamming distance are capable of Manhattan distance estimation using thermometer encoding as reported in [2] , they require 2' bit length for i-bit data elements. Therefore, associative processing with a compact bit length requires the natural binary coding for Manhattan distance such as [6]-[9] .
In this paper, we present a word-parallel associative engine with accurate and wide-range Manhattan-distance computation. The word-parallel digital implementation using a hierarchical search path enables a high-speed search operation with faultless precision, a low-voltage operation mode, and a potential capability of unlimited data capacity. These features are important for a system-on-a-chip application in future process technologies, which is difficult to attain using the conventional mixed-signal approaches [7]-[9] . Furthermore, it performs a continuous search operation to detect not only the nearest-match data but also all data in the sorted order of the exact Manhattan distance. It requires considerable search operations in case of the conventional architectures [6]- [9] . Wordparallel distance calculation circuits autonomously count the Manhattan distance using a weighted search clock to detect the nearest-match data. The unique associative processing with accurate and wide-range Manhattan-distance computation efficiently realizes various new applications such as human-like 
14-4-1
learning and high-speed data sorting in addition to the conventional use. An associative engine, with 64 words of 8 bit x 32 element, has been fabricated in a 0.18 pm CMOS process and successfully tested.
Word-Parallel Manhattan Distance Computation

A. Element Circuit Structure and Computation Flow
Associative processing based on Manhattan distance generally handles i-bit x j-element data as shown in Fig.1 . Manhattan distance compuation requires SAD (summation of absolute difference) between an input and all stored data. Fig.2 (a) shows an 8-bit element structure. The stored data are divided into blocks and hierarchically connected by a bypass line to reduce the search signal propagation path as shown in Fig.2 (b) . The 8-bit element consists of 8 S U M cells, a bit selector, a subtractor based on a half adder (HA) with an absolute function (ABS), a flag register (FR) with a bit comparison function, and a chained search circuit as shown in Fig.3 .
The present algorithm and circuit implementation for Manhattan distance computation are shown in Fig.4 through Fig.7 . First, absolute flags are generated in element parallel. Then, a distance counting operation is executed by a chained search signal propagation in word parallel. It is processed by weighted search clocks which are autonomously provided by wordparallel distance calculation circuits. Finally, the nearest-match data is detected in Candidates which are activated by the wordparallel calculation circuits at the same time. All the data can be detected by a continuous search operation in the sorted order of Manhattan distance. comparison result F,k is stored in a flag resister and used for an absolute function by switching a carry result C;, of HA between A;, . and Bi,. The absolute difference is calculated in element parallel during the word-parallel summation.
B. Absolute Flag Generation
C. Distance Counting Operation
The distance counting operation is executed from LSBs to MSBs of elements in word parallel as shown in Fig.4 (b) . A sum result SO, of Aoj and Bo, is set to Mjk as a control signal of a chained search circuit. A search signal detects the firstencountered mismatch bit with kfjk = 1 in each block. The search clock period is limited by the search signal propagation path via chained search circuits. Therefore, a hierarchical search path based on [3] is implemented as shown in Fig.2 (b) . A bypass search signal Pkb is also used for a mask permission signal to the next block, which makes only one mismatch bit maskable in each word for the next clock period. The interrupted search signal starts again from the masked bit, and finally a search signal can be detected as Soutk when all the mismatch bits have been masked. Therefore, the operation clocks represent the number of mismatch bits. After that, a distance counting operation is executed again for a carry result CO, in a similar manner to the counting operation for a sum result S o j . These counting operations are repeated from AoJ to A7,. Fig.5 shows a word-parallel distance calculation circuit using autonomous weighted search clocks. The word-parallel circuit receives the search output signal Soutk, and it counts the Manhattan distance based on a weight of a search clock q k h . A search clock has different weights according to the bit number i that is currently evaluated in elements. For example, it has a weight of 2'-and 2'+'-bit Manhattan distance during a counting operation for i-th sum and carry outputs, respectively. A word-parallel circuit autonomously provides &chk to count all the mismatch bits faster. Therefore, it has a local weight wlk as a current weight of 4schk. and accumulates a global weight Wg on a residual weight Wrk as shown in Fig.4 by the precedence are stored as a residual weight Wrk. In the present counting technique, the number of processing elements per word is determined by just the bit length N per element as shown in Fig.5 . A word-parallel circuit also controls bit select signals Se& according to w l k , and finally provides Actk to a priority address encoder as a Candidate.
D. Weighted Search Clock Technique
E. Nearest-Match Detection in Candidates
The distance counting operation is interrupted at the detection timing of Actk, and then the process moves to nearest-match detection for Candidates as shown in Fig.6 . Candidates are all the words activated by Actk at the same time. They have different residual weight according to their Manhattan distance from the input since the distance is given by CWg -Wrk. CWg is the total distance weight operated before the detection timing Of ACtk. Note that Candidates are closer to the input than all the other undetected words in the present search algorithm, hence they include the nearest match data. This feature contributes to detect the nearest-match data, and also enables a continuous search operation for data sorting in order of the exact Manhattan distance. The nearest-match detection in Candidates is carried out by a nearest-match detector and a priority address encoder. It evaluates each residual weight Wrk from MSB to LSB as shown in Fig.6 . The process maintains consistency with each other word. It keeps all residual weights other than the nearest data in Candidates, and then the detected nearest data is masked to continue a search operation for the next nearest data. The circuit configuration is shown in Fig.7 .
Chip Implementation
We have designed and fabricated an associative engine using the present search architecture in a 1P5M 0.18 ,um CMOS process'. Fig.8 illustrates a block diagram of the search engine. It consists of a search memory array with 64 words of 8 bit x 32 element, a memory read/write circuit with data shift registers, a word decoder, word-parallel distance calculation circuits, a priority address encoder for nearest-match detection in candidates, and a CAM controller. These components are implemented in a die size of 2.8 x 2.8 mm2. Fig.9 shows a chip microphotograph and an 8-bit element cell layout. A 32-element word is divided into four blocks to reduce the critical path.
'The chip in this study has been fabricated through VISI Design and Education Center(VDEC1, University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Co. Search range (up to i-th nearest data) Fig. 11 Characteristics of the present continuous search operation for wide-range associative processing.
14-4-
Measurement Results and Discussions
The measurement results show that the operation speed attains 294.1 MHz and the power dissipation is 320.7 mW at a supply voltage of 1.8 V. The total search time for nearest-match detection is 2.00 ps in the worst case. Fig.10 shows the operation speed as a function of the supply voltage from 0.8 V to 2.0 V.
The fully digital implementation enables a low-voltage operation mode up to 0.8 V. It attains an operation frequency of 72.4 MHz and a power dissipation of 15.1 mW at 0.9 V. The associative processing ensures Manhattan distance computation with faultless precision. Fig.11 shows the worst-case search time for wide-range Manhattan distance computation. The present search engine is capable of a continuous search operation to detect all data in the sorted order of the exact Manhattan distance in addition to the nearest-match data. It efficiently realizes a wide-range search operation as shown by (a) in Fig. 11 . On the other hand, the conventional architectures require considerable search operations. Fig. 11 (b) is estimated based on [6] as a conventional digital technique. Fig. 11 (c) is estimated based on [9] as a conventional mixed-signal technique assuming that it is scalable to the same capacity as the present coprocessor since there was no report on such a long distance search by mixed-signal techniques so far. A capacity scalability is also one of advantages of the present digital implementation. Table I shows the core area and SRAM ratio of various data capacities. The integration ratio of SRAMs is almost equivalent to the ratio of 19 % of the conventional digital processor [6] . Furthermore, the present architecture has the possibility of a large database capacity in a practical die size since it makes device scaling easier than the conventional mixed-signal techniques. Table 11 summarizes the chip specifications. 
Conclusions
We have proposed a new word-parallel digital architecture and circuit implementation for accurate and wide-range Manhattan distance computation employing a hierarchical search path and a weighted search clock technique. It is capable of the detection of all data in the sorted order of the exact Manhattan distance in addition to the nearest-match data. The weighted search clock technique performs the wide-range associative processing with fewer additional cycles. Furthermore, the digital implementation enables a low-voltage operation for SoC applications in future process technologies. It also makes device scaling easier and provides the possibility of a large data capacity with unlimited search distance. An associative engine, with 64 words of 8 bit x 32 element, has successfully performed the Manhattan distance computation. The worst-case search time of all data sorting takes 5.85 ps at a supply voltage of 1.8 V.
