A compact CMOS vision sensor for the detection of higher level image features, such as corners, junctions (T-, X-, Y-type) and linestops, is presented. The on-chip detection of these features significantly reduces the data amount and hence facilitates the subsequent processing of pattern recognition. The sensor performs a series of template matching operations in an analog/digital mixed mode for various kinds of image filtering operations including thinning, orientation decomposition, error correction, set operations, and others. The analog operations are done in the current domain. A design procedure, based on the formulation of the transistor mismatch, is applied to fulfill both accuracy and speed requirements. The architecture resembles a CNN-UM that can be programmed by a 30-bit word. The results of an experimental 16x16 pixel chip demonstrate that the sensor is able to detect features at high speed due to the pixel-parallel operation. Over 270 individual processing operations are performed in about 54 µsec.
Introduction
The conventional approach for pattern recognition, consisting of image capture and subsequent analysis by a host CPU, has two time consuming steps: the time required for image transfer and the time required for analysis. The former is determined by the video rate (1/33 msec) while the latter is determined by the amount of data and the algorithm employed. These two steps make it difficult for the conventional approach to be used for real-time pattern recognition such as high-speed product inspection on an assembly line, character recognition, autonomous navigation, and so on. The time-consuming steps originate from the inherent separation between image capture and image analysis.
The above observations have motivated new type of image sensors that incorporate a processing element at each pixel. The on-chip processing element extracts relevant information for subsequent pattern recognition, resulting in a reduced amount of data. Implementations have been done in both analog and digital domains using CMOS technology. The analog approach performs computation by exploiting the analog interaction between pixels [1] [2] [3] [4] . No external clock is required for computation. The detected features have been primarily edges and orientations, which are relatively low level features corresponding to those detected at the early stages in the pathway of the visual system. A chip that has more general image processing capability has been also proposed [5] .
On the other hand, the digital implementation provides programming flexibility [6] [7] [8] . The sensor operates in an iterative fashion by updating the memory content based on a specified program. In Ref [8] , Ishikawa has presented a sensor in which pixels have both compactness and programmability by the incorporation of an ALU (algorithmic logical unit) at each pixel. This approach, however, has limitations for image processing that is based on neighborhood interaction, e.g. 3 x 3 window, since the instruction takes only two inputs in a single instruction.
The concept of cellular neural network (CNN) has been also extensively employed [9] [10] [11] , in which computation is performed in an analog mode. The CNN concept has been further extended to CNN universal machine (CNN-UM) architecture to include programming capabilities [12] . Domínguez-Castro has built a sensor based on this principle and demonstrated its functionality for image processing applications [13] . However, the operation is not very fast, usually in the order of microsecond, due to the settling behavior of the analog interaction between pixels.
Despite all these research activities, no attempt has been made so far to detect higherlevel features such as corners and junctions in a discriminative fashion. Detection of these features is very important for the purpose of object recognition [14] . Even in the digital implementation, which has programming flexibility, existing software is too complicated to be mapped into hardware implementations. With final hardware implementation in mind, we have developed an efficient feature detection algorithm that is partly inspired by biology [15] . The algorithm first decomposes an input image into four orientations and performs an iterative computation resulting in the detection of corners, three types of junctions (T-, X-, Y-type), and linestops. Each operation is carried out based on template matching.
The paper presents the hardware implementation of the above algorithm. Template matching is mapped onto configurable hardware by an analog current-mode operation. An operation can be iteratively executed as many times as required according to an external digital control. Thus, the proposed sensor is a simplified form of a CNN-UM architecture, optimized for high-speed operation.
A key aspect of the design of current-mode circuits is the transistor mismatch. We will present a systematic procedure to determine design parameters including transistor dimensions and currents. Limited information is available in the literature for the design procedure based on transistor mismatch. In a paper dealing with transistor mismatch [16] , Lakshmikumar et al.
have presented a method for determining the required accuracy for current sources in order to satisfy a specific yield. A procedure for the determination of sensor's design parameters for the required accuracy is presented in this paper.
The paper is organized as follows. Section 2 describes the overall design of the sensor, including a brief discussion of the algorithm. Section 3 presents the design procedure based on transistor mismatch analysis. Section 4 explains the details of the implementation, focusing on mechanisms to improve the speed. The experimental results are given in section 5, followed by a discussion and conclusion.
2
Overall sensor design 2.1 Algorithm Fig. 1 shows the processing flow of the feature detection algorithm for a letter 'A' [15] .
The image is decomposed into four orientation planes. Then a series of operations are performed for each orientation plane. The results of processing in each orientation plane are combined to generate the final five features (T-, X-, Y-type junctions, true corners, and true linestops). All these operations are carried out by template matching in a 3 x 3 window by using a set of weights specific for each operation:
is the binary status of the pixel at the position (i,j) at a discrete instant n, ij r is the element of the template, f is the function to generate a binary output with the threshold I given in the form below:
As discussed in Ref [15] the template matching can be implemented as a convolution, using current distribution and summation, where the convolution kernel (distribution pattern) is the flipped version of the template. 
Chip overall architecture

Chip pixel architecture
The conceptual architecture of the pixel circuit is shown in Fig. 3 . The sensor consists of a phototransduction (shown on the right-hand side) and a processing circuit. In the phototransduction stage, the output current of the phototransistor I ph is compared to a threshold current I th1 to produce a binary output V ph , which is transferred to the memory when signal photo is set to logic High.
The processing circuit is constructed in a mixed-mode fashion in the sense that the internal operation is performed in the analog domain while the control of the operation is performed in the digital domain. The circuit consists of the memory and the convolution unit with programmable kernels (distribution pattern). The basic function of this circuit is to distribute a current to 3 x 3 neighborhood pixels if the memory content is logic High. The distribution pattern is specified by setting switches along the connection path to neighbors, which are not shown in Fig. 3 . The currents from neighboring pixels, in turn, are summed to generate I sum and compared with I th2 to generate a binary output V o . This binary voltage is then transferred to the memory by a two-phase clock ( 1 and 2 ) and a parasitic capacitor.
A typical processing sequence is as follows. Signal photo is set to logic High to store the result of phototransduction V ph into the memory. Next, the signal photo is set to a logic Low to start a series of operations. Different kernels and threshold values are specified for each operation to perform various types of processing functions.
Processing circuit
The complete schematic of the processing circuit is shown in Fig. 4 , which is a detailed representation of the conceptual architecture shown in Fig. 3 . Note that there are six memories.
Four of them (M a , M b , M c , M d ) are used to store the image in four orientation planes while the other two memories are used as working memories. Associated with each memory are six reference current sources, whose amount is determined by the bias voltage V ref . A set of signals (a1, b1, c1, d1, x1, y1) determines which memory will be accessed. When one of these signals is on, and the corresponding memory stores a logic High, and signal RE is logic High , the reference current I ref will pull down the voltage of node Nc, resulting in current spreading to the neighboring pixels. While the current is distributed to neighboring pixels, the currents generated at neighboring pixels are accumulated on a thresholding node Nc. The accumulated current I sum is compared to the threshold current I th , specified by signals V th1 and V th2 . With these switches set High. The operation can be repeated as many times as needed.
Design procedure
A procedure to design the channel length and width, and the amount of the reference Fig. 4 (C, N1, N2, NE1, NE2, E1, E2, SE1, SE2, S1, S2, SW1, SW2, W1, W2, NW1, NW2, a1, b1, c1, d1, x1, x2, y1, y1n, y2, y2 , Vth1, Vth2, RE). current is given below to satisfy a given requirement for accuracy and speed. The specification for the present sensor has been defined as the operational clock frequency larger than 5 MHz and the error rate smaller than 0.1 %.
Formulation of the current variation
The analysis starts from the following basic relationship:
where I is the drain current, V GS is the gate-source voltage, V T is the threshold voltage of a transistor, is a current factor, µ is the carrier mobility, C ox is the oxide capacitance per unit area, W and L is the transistor width and length, respectively. For a set of transistors that are biased with a constatn gate voltage V GS , the variance of I is expressed as a function of and V T . One can easily prove that
in which , I
, and T V are the mean value of I , and T V , respectively.
The variance of and V T are known to be inversely proportional to the transistor area [17] :
where T V A and A are mismatch proportionality constants for V T and , respectively, W and L represent the width and length of the transistor channel, respectively. Substituting equations (5) and (6) into (4) yields
For reported values T V A and A , the effect of the second term is much larger than the first term unless the transistor is too heavily biased. Hence, by dropping the first term, the above equation is simplified as
Using the relationship in equation (3), the following formula is obtained:
in which s 2 is defined as the relative variance of the current when its amount is equal to I ref .
It should be noted that the relative current variation is inversely proportional to L 2 and is independent of W. I can be also explicitly expressed as a function of I ref and L for a more general case in which the current is mI ref and the channel length is bL,
(10)
The variation of the current for a set of multiple current mirror circuits is analyzed next.
When V GS is treated as a random variable as well as 1 T V in equation (3), the following formula is obtained (effect of is neglected): 
The gate voltage V GS is expressed as
(12) 12 The above formula can be used to obtain
or equivalently 
By substituting equation (14) into (11), we get 
Since
, and the second and the third terms on the right-hand side is equal to 2 s , equation (15) can be written as (16)
Note that the current mirror operation increases the variance of the current by 2s 2 I ref 2
. When the current is mI ref and the channel length is bL, the above formula is more generally represented as 
Formulation of accuracy requirement
Based on the formulations obtained above equations (10) and (17), the relationship between the design parameters and the accuracy of the operation is given below for the detection of o 45 line segments. This is the most difficult operation in the feature detection algorithm because the number of neighbors used in the computation is larger than other operations. The template consists of three 1's at the right diagonal position, three -1's at the upper triangle or the lower triangle, and three 0's at the other side of triangle [15] . When the current output of I 5 is larger than the threshold current of I th =1.5I ref , the local orientation is determined to be o 45 . The variance is also given for each of them.
corresponding to the output current of 2I ref , the variance of current I sum (represented in Fig. 4 ) is
When the pixel is not categorized as o 45 by the presence of three 1's and two -1's as shown in Fig. 6 (d Fig. 7 where the x-axis is scaled to I ref .
For s=0.03, which is shown on the left-hand side, the probability distribution is not very broad and almost no overlap exists between p 1 (x) and p 1.5 (x) , and between p 2 (x) and p 1.5 (x), On the other hand, for s=0.05, which is shown on the right-hand side, the distribution is broader, resulting in a considerable overlap between p 1 (x) and p 1.5 (x), and the overlap between p 2 (x) and p 1.5 (x).
The generation of these overlap regions indicates a classification error. Table. 1 shows the error rate broken down in two components resulting from overlap of the two right curves (p right ) and from the overlap of the two left curves (p left ), as estimated by Monte-Carlo simulations for different relative current variations. As expected, the error rate is lower for smaller values of s. The error occurrence on the left-hand side is higher than that on the righthand because the distribution p 1 (x) has broader distribution than p 2 (x). The data in the table indicates that for an error rate to be smaller than 0.1%, s should be smaller than 0.03. The 
Formulation of the operational speed requirement
In addition to accuracy, the operational speed is another important specification for the sensor circuits. The response is largely determined by charging and discharging of the capacitance associated with the gate of the current mirror transistors as is schematically shown in Fig. 8 .
The capacitance drawn as a dashed line represents the parasitic capacitance connected to that node, which consists of the gate and drain capacitance of M 1 and the gate capacitance of M 2 .
Analysis of the circuit of Fig. 8 gives the rise time t r and fall time t f as
where t r is defined as the time required for the current to rise to I 1 = I ref , and t f is defined as the (9)). The speed requirement is specified using the fall time since the fall time is larger than the rise time when =0.95 and =0.05,
conventionally used values for the rise time and the fall time.
Determination of the design parameters
The relationship between the reference current and the channel length to satisfy various accuracy and speed requirements is plotted in Fig. 9 . The channel width is chosen to be 1.5 µm instead of the minimum dimension of 0.9 µm for the 0.5 µm technology. The minimum width is avoided since the current sinking NMOS transistor operates close to the edge of the saturation region. It is clear from the graph in Fig. 9 In the present design, the current level was chosen as 4 µA to satisfy both speed and accuracy requirements without consuming too much power. For the current amount of 4 µA, the channel length for the NMOS and PMOS transistors are chosen as 3.7 µm and 2.2µm, respectively, to satisfy the accuracy requirement (shown as closed circles on the graphs).
Implementation
Mechanisms for high speed operation
It is clear from the circuit shown in Fig. 4 that the operational speed is mainly determined by the time required for charging and discharging node Nc, which has a large capacitance due to many PMOS gates connected to this node. In order to decrease the time for the charging and discharging of this node, two additional circuits have been implemented as shown in Fig. 10 . The second mechanism to improve the speed is shown in the darkly shaded block in Fig.   10 . This circuit is a simple self-feedback inverter with two switches to keep the voltage at the thresholding node constant. The inverter is designed to have a switching voltage of 1.15V. The voltage is one half of 2.3V, which appears at the output of an NMOS switch after transfer of logic High (=V DD ) signal. The voltage is kept constant during 2 phase and an extended time period by an additional control signal 2d , which is a delayed version of 2 . By this mechanism, the initial voltage from which the voltage starts changing can always be set to 1.15 V. Even a small change in the voltage drives the next inverter INV1 to either logic High or Low to achieve high-speed operation. The additional duration control by signal 2d , which is a delayed version of the signal 2 , is to ensure that the state of all transistors sourcing a current towards the thresholding node or those sinking a current from the thresholding node are completely settled at the end of 2d . This indicates that the result of the current comparison is immediately reflected as the change of the voltage of the thresholding node. Fig. 11 shows the simulation result with the modified circuit at the frequency of 10 MHz. The delay between 2d and 2 was chosen as 30 nsec, which is the required time for transistor M 1 to be turned off completely. All the operations are executed correctly. The node voltage at the thresholding node V o is kept at 1.15 V until 2d gets low.
Immediately after 2d goes low, the voltage V o starts decreasing for op1, op2, and op3 operations to produce the correct output. Fig. 12 shows the schematic of a timing control circuit. The inputs for the circuit are a master clock (CLK) and 30-bit control signals (CS). The circuit generates two non-overlapping clocks ( 1 and 2 ). 2d is then generated by introducing some delay for 2 . The amount of delay is externally controlled by setting the number of inverters 1 has to go through. The control signals CS, whose status changes when CLK rises, are gated with d 2 2
Control Circuits
to produce truncated control signals (CS1). The new control signals CS1 are generated so that: (1) the onset of the signals CS1 is synchronized with the time when the operation phase is initiated ( 2 goes Low),
(2) when the processing result stored at the intermediate node is transferred to the memory ( 1 goes Low 1 goes High), the control signal is still maintained (until 2d goes High). Fig. 13 (a) shows the layout of the pixel. Care was taken to make the pixel shape which consumes more than half the entire area, was used for the common signal lines, which were laid out both horizontally and vertically. The first and the second metal layers were used for signal lines while the third metal was used for shielding the circuit from light illumination.
Layout and Fabrication
An array of 16 × 16 pixels were placed on a chip area of 3.2 mm × 3.2mm. Fig. 13 (b) shows the layout of the entire chip. Based on this layout, the sensor was fabricaed using HP 0.5 µm technology through MOSIS.
Experimental Results
The fabricated sensor was mounted on a test board. Various images were projected on the sensor through a lens. An additional light source was used to clearly produce the binary image. The power supply voltage of 4V was used for experiments.
Mismatch measurement
The degree of transistor mismatch was estimated by measuring the current output (I C in .
(24)
Since I C is obtained as a result of PMOS current mirroring, its variance is represented as 
This value is almost one half of the value assumed in the simulation (15 mVµm).
However, the assumed value was a rather conservative estimate, and the measured value of
[mVµm] agrees well with the reported result extrapolated toward smaller feature sizes [19] . As a result of this rather conservative estimate for T V A , the error rate should be lower than expected.
Speed measurements
The maximum operating frequency was investigated as a function of the reference current and the internal delay between 2d and 2 . For this purpose, line stop detection operation was applied for an image that consists of a set of line segments whose length are two. If the operation is performed without error, these two pixels are both detected as linestops. The maximum operating frequency is defined as the frequency below which no detection error occurs (error rate of 0 %). Instead of finding a maximum operating frequency, a minimum required current for a given frequency, which can be set only to certain discrete values with a maximum of 5 MHz, is determined. The experiment was carried out for different settings of 26 internal delay between 2 and 2d .
The obtained relationship between the maximum operating frequency and the reference current for different settings of internal delay is shown in Fig. 15 . It is obvious from the graph that the maximum frequency almost linearly increases as a function of the reference current.
The maximum frequency also increases as the delay of 2d increases. It should be noted that the degree of improvement of the speed becomes less and less as the delay increases, and seems to be reaching the point close to saturation when the delay is 35.4 ns. This is the time required for the sourcing and sinking currents to become almost zero, which is close to the estimate of 30 nsec obtained in the simulation. No further improvement in the operational speed would be expected for larger delays. From this experiment, it is concluded that a frequency of 5 MHz is In certain cases the sensor produces unexpected results. For example, for the slanted letter "R", two pixels are detected as linestops instead of corners. This is because the o 45 line segment constituting the loop is removed by the EIP (Elimination of Isolated Points) operation due to its length being two pixels. This type of removal of short line segments also explains why only seven corners are detected instead of eight for the letter "O" and the unexpected linestops along the stroke for the letter "S". This issue originates from the low resolution of the sensor: an array of 16 16 × is not large enough for some letters to apply the present feature detection algorithm. Hence, the problem would be easily solved if the larger number of pixels were used in future. which is short considering the complexity of the operations involved. By increasing the functionality of the memory the above-mentioned operations could be performed also in a single step, further reducing the overall processing time.
Responses to letter images
Discussion
The fabricated feature detection sensor operated successfully at high speed. The sensor has considerably more computational power than the one obtained by the sensor architecture using a single conventional signal processor. Suppose that the processor operates at the clock frequency of 1 GHz and that each arithmetic computation requires 1 ns. Also, suppose that the weighted sum is computed in one clock cycle. These assumptions lead to the computation time for template matching for 3 3 × neighbors of 9 ns, which further lead to the total computation time of 2 9N ns for a pixel array of size N N × . If N is equal to 16, which is the size of the present sensor, the computation time is 2.3 µs. This is ten times larger than the operational speed of 200 ns obtained from the sensor described in the paper. The high speed of the present sensor comes from the parallel computing of the processing elements at each pixel. The speed advantage of the sensor scales as N 2 since the time required for single processor architecture is proportional to 2 N , while that required for the present sensor is independent of N. Table 3 summarizes the specifications of the sensor. Since the sensor is a prototype to demonstrate the principle of on-chip feature detection, the pixel size is still large and the performance can be further improved by several modifications. First, the channel length can be made shorter since the experimentally derived value of Second, the communication between pixels may be simplified. For example, Instead of directly distributing a current to diagonal neighbors, it may be possible to distribute a current to the horizontal neighbors and these neighbors further distribute the current toward its vertical neighbors. Such a scheme would significantly simplify the wiring for connecting neighboring pixels and hence reduce the pixel area. Third, use of technology of smaller feature sizes enables higher pixel densities in future versions. If the 0.25 µm technology is used, the pixel size would reduce to about 80 µm × 80 µm, enabling an array of 128 x 128 pixels in a chip area of 10 mm
x 10 mm. This is a practically useful resolution and is technically feasible to implement. With these modifications, the feature detection sensor may become practical for high speed feature detection applications.
Conclusion
The paper describes the design and implementation of a new type of VLSI computational sensor. The sensor consists of an array of 16 16 × processing elements, each measuring 150 µm × 150 µm in a chip area of 3.2 mm × 3.2 mm. The sensor detects important image features including corners, three types of junctions (T-type, X-type, Y-type), and linestops for a binary image in a discriminative fashion. To realize fast operation while keeping accuracy, a design procedure based on transistor mismatch has been proposed and successfully employed. Since these features are detected on-chip in about 50 µsec, the sensor can be used for various types of applications requiring high speed feature extraction. 
