Introduction
In image recognition, image feature vector generation is most essential but computationally very expensive. A number of feature representation algorithms [1] [2] [3] [4] have been developed based on oriented edges extracted from images. We have developed two oriented-edge-based feature representations: projected principal-edge distribution (PPED) [3] and averaged principal-edge distribution (APED) [4] , being inspired by biological principle [5] . They are complementary to each other and concurrent use of them was successfully applied to robust image recognition [4] . However, due to the high computational cost, it is difficult for general purpose processors to achieve real time performance and low power consumption in generating these two vectors from every pixel site of input images.
To accelerate the processing, a dedicated processor was developed [6] , employing arrayed shift registers directly connected to parallel adders. The chip generates either single PPED or APED vector in every 64 clock cycles. To further accelerate the processing, another dedicated processor was developed, employing a 65×256-bit functional SRAM, a crossbar switch module, and 128 SIMD processing elements [7] . The chip generates a single PPED vector in every two clock cycles, but it cannot generate APED vectors because the vector generation algorithms are quite different in PPED and APED.
In this work, a versatile processor capable of efficiently generating both PPED and APED vectors is presented, employing a 65×256-bit regular SRAM and a 65-bit shift register along with only 16 SIMD processing elements of enhanced functionality. The proof-of-concept chip was designed in a 0.18-µm CMOS technology and the operation at 62.5MHz was verified by measurement.
Complementary Feature Representations
Four directional edges are extracted from an input image by pixel-by-pixel scanning the entire image with 5×5-pixel size kernel filters to generate four edge flag maps ( Fig.  1(a) ). A 64×64-pixel recognition window is divided into 16 bins and the edge flag count in each bin of 64×64-pixel edge flag maps constitutes a single element of a 64-dimension feature vector. In PPED, bins are formed parallel to the edge orientation ( Fig. 1(b) ), and in APED each edge flag map is divided into 4×4 square bins ( Fig. 1(c) ). They are complementary to each other in the following sense. PPED is well representing the overall shape of an object and not very sensitive to entire image distortion, while APED is well representing the spatial relationship among constituent parts of an object and distinctive to the difference in shapes indifferent to local variations. Robust image recognition was demonstrated using both PPED and APED concurrently [4] . To recognize multiple objects in a scene, we must pixel-bypixel scan the entire edge flag maps with the 64×64-pixel recognition window and generate both PPED and APED vectors at every pixel site. Such computation is extremely expensive, but very smartly implemented in a digital architecture in this work. Fig. 2 (a) illustrates efficient calculation of 45-degree PPED vector elements, where the 64×64-pixel recognition window was replaced by a 4×4-pixel recognition window for simplicity of explanation. By repeating the add-and-shift operation four times as shown in Fig. 2(a) , a PPED vector is generated. The next PPED vector for the window one-pixel shifted down is easily obtained by adding edge flag bits in the new coming-in row and subtracting those in the goingout row, to and from the previous vector data left in the PEs, respectively. For the 64×64-pixel recognition window, 64-
System Architecture Processing Elements

An Efficient Image-Vector-Generation Processor for Edge-Based Complementary Feature Representations
Naoya Yamashita and Tadashi Shibata
Department of Electrical Engineering and Information Systems, the University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan Phone: +81-3-5841-6730 E-mail: naoya@if.t.u-tokyo.ac.jp, shibata@ee.t.u-tokyo.ac.jp dimension PPED vectors are efficiently generated in every two clocks. Fig. 2(b) illustrates a simplified explanation for efficient APED vector generation from a 4×4-pixel recognition window. Two neighboring flag bits are added beforehand and then shifted-in to every two PEs (16 neighbors are added and shifted-in to every four PEs in the case of real 64×64 windows). By repeating the process four times, one APED vector is generated. For one pixel down of the recognition window, only the data in the new coming-in row and the going-out row are utilized to generate the next APED vector. For the 64×64-pixel recognition window, 64-dimension APED vectors are efficiently generated in every four clocks. Figs. 3(a, b) show the architecture of 16 SIMD PEs and a single PE. A PE is composed of eight up-and-down counters. Registers in the counters are connected to their neighbors, thus composing a shift register. Control of up/down count and right/left shift is determined depending on edge direction. Execution of up/down count is controlled using edge flags as enable signals.
On-Chip Memory
In order to efficiently store edge flags sent from an external chip and seamlessly transfer them to PEs, a new data read/write scheme is introduced for the on-chip memory. As a result, the memory capacity as well as the overhead time for starting the processing is minimized with much simpler configurations than the previous work [7] . Fig. 4 shows the edge-flag store and read-out scheme in the on-chip memory. At first, edge flags only in the limited area of 64×256 pixels taken from an input edge map are stored in the on-chip SRAM, leaving a single right-most column empty. Overhead time arises only from this stage. Then, 64-bit edge flags are read out from the top row and sent to PEs. At the same time the 64-bit edge flags are filled into the top shift register and a new arriving 1-bit edge flag from the input edge map is inserted to the left and the data are overwritten to the top row. By repeating this process all the way down to the bottom, the SRAM contents are updated to the edge flag data one-column shifted to right. By further repeating this process, all edge flags in the edge map are converted to feature vectors seamlessly at every pixel site.
Measurement results and discussions
In order to verify the concept, a chip was designed in a 0.18-µm CMOS technology and efficient generation of both PPED and APED vectors was demonstrated. Fig. 5(a) shows the chip photomicrograph. Figs. 5(b, c) show the measurement results. The chip operates at 62.5MHz and generates both PPED vectors in every two clocks and APED vectors in every four clocks. Figs. 6(a, b) show comparison of this work with previous works in terms of the frame rate and the energy per frame when scanning VGA images. The chip in this work generates the edge-based two complementary feature representations 13.8 times faster than the previous work [6] and has achieved for the first time to generate these representations at a rate compatible to real time recognition of motion images (> 30fps). The chip in this work consumes energy per frame that is 3.44 times smaller than that of the related work [8] and has achieved low power consumption. 
