ABSTRACT
Introduction
After the CNN was introduced by Chua [l] [2], many potential applications were reported in the literature. Even though the CNN architecture has great potential, the hardware implementation has a lot of problems that are mainly due to the large required array size. The main advantage of CNN is the fact that the function of the array is determined by the template. This means that the CNN can be considered as a general purpose computing architecture. Therefore, the templates including the resistor should be fully programmable. The difficulty in CNN designs is the low accuracy of the analog circuits especially when they are designed with small silicon area and low power consumption constraints. For time multiplexing image processing operations [3] an optimal array size should be sought based on the accuracy, computational speed and IC processing cost because the cost increases rapidly as the array size increases.
2.3~3 CNN T-t-chip
A fully programmable 3x3 CNN test chip has been fabricated using a 2pm doublepoly double-metal CMOS proms through MOSIS to investigate the possible problems that will be encountered when a larger array is fabricated. The designed CNN circuit has not only programmable templates but a programmable resistor as well.
The size of a cell is 0. 16"' and the static power consumption is l.SmW/cell.
The main circuits in our proposed CNN are multipliers and integrators. Figure l(a) shows the circuit diagram of the limiter and the multiplier. The limiter provides some tolerance against the charge leakage through the nonideal switches during the initialization stage. The same circuit topology is used to implement a programmable resistor. In the case of the resistor, the differential pair in the limiter is replaced with a linearized differential pair [4] for wider dynamic range. The implemented integrator uses an opamp to reduce the loading effect of the multipliers. A transconductor (not shown in the figure) is provided at the summing node of each cell. Even though the singleended implementation has higher offset than the fully-differential implementation, the multiplier and the integrator are chosen to be singleended to avoid complex interconndons and common-mode feedback circuits. A current mirror is provided for each multiplier instead of implementing a large one at the summing node. This 0-7803-3261-W96/$5.00 Q 1996 IEEE.
IMPLEidENTATION ISSUES -I scheme distributes the devices over the silicon and helps to make the offset mean to be around zero. Several control switches are provided around the integrator (see Figure l(b) ). The switches o p e~t e as follows: 1.Turn off the startand reset all the cells at the same time.
2.Set address to tum on the select,, Turn on the 4,. and apply x,(O) to the input node.
3.Tum on the & and apply U, to the input node.
4.Repeat 2 and 3 for each cell. 5.Turn on the start. 6.Read out the result cell by cell.
4
The start switch should be kept on until all the results are read out because the turning off the switch causes the feedthrougb. 
Yield Analysis
A functional yield analysis was carried out based on the layout of our design. This analysis allows us to study the manufacturing feasibility of large CNN arrays taking into account the environmental conditions of the silicon fab. Parametric faults are not considered and only the functional behavior is inspected.
In mature manufacturing lines spot defects are the main detractors in the successful outcome of an IC. Their manifestation is as local disturbances of silicon layer structures mainly caused by dust particles, process variabilities, and contamination of the fabrication equipment. Spot defects are in essence random phenomena occurring on the wafer with certain stochastic size and frequency per unit area (defect density). In order to verify the robustness of the design when it is exposed to defects in a real manufacturing environment, it is necessary to extract its Caitical areas. The so called critical areas are the places in the layout where a defect can induce a catastrophic behavior in the IC. For instance, an extra material spot defect can cause a short and a missing material spot defect can in turn induce an open circuit. A figure of merit that measures the design's vulnerability is obtained as the ratio of the total critical area to the total layout area. This figure of merit is known as defect sensitivity. By combining the defect sensitivity with the actual defect size distribution existiug in the manufacturing line one can predict the design's average probability of failure. This is evaluated as the integral of the sensiti~ty times the defect size distribution. Finally, yield is the probability of manufacturing ICs without faults taking into account the previously discussed issues. In our projections we are using the negative binomial yield formula where A, is cell area, n is the number of cells, D is the defect density, a is a defect clustering parameter, and $ ($ = 0.15)is the probability of failure taking into amunt design and environmental conditions. For our simulations we assumed a defect size distribution that follows the I/! law[5] a peak defect size of l p , a defect density of 1 def/cm2, and a = 2 which is a reasonable clustering. These parameters are representative of "clean" manufacturing.
Figure 2(C) presents the corresponding yield projections up to an array size of 10, OOO cells (100 x 100 array). These results show that with the current technology and present design a 100 x 100 array is unfeasible since only a 21%
yield can be attained. Other approaches such as MCMs or multiplexing with smaller chip sizes should be sought.
Optimal Multiplexing Level
Originally, the CNN is intended to be a fully parallel processing architecture. The hardware implementation of a large array that can process a practical image size is extremely difficult and of high cost with the current integrating technology. It is true that fully parallel processing realizes high processing speed. However, the speedup over the multiplexing pseudo-parallel procasing is not significant when the image size is large unless any fully parallel input and output scbeme is provided. In a fully parallel archimure with sequential YO data, the processing time of a fully parallel implementation, r,, for an image of n pixels can be approximated as
where 2, is the time to read and write a pixel, and t d is the time spent for the dynamics until all the cells are converged. The n-pixel image can be divided into k sub-images and then be p r m s e d by an n/k array of cells by sweeping over the whole image once. If the feature size is much smaller than the image size and the degree of multiplexing is not high then YO data time for overlaps is negligible. Therefore, the total procasing time of a multiplexing implementation, t. , can be roughly evaluated as
layer, c)Expected yield
The efficiency, s, of multiplexing processing over fully parallel prmsing can be found as a "factor" that measures how slower is one approach versus the other. We call this factor "slow-down" and defined it as under the assumption that t, = t d . Clearly, the slow-down factor due to multiplexing gets smaller as the image size inaeases.
The manufacturing cost of integrated circuits is roughly inversely proportional to its yield. Based on the yield model in (l) , the cost of the n/R cells array can be normalized by the cost for R=l as
The optimal multiplexing level, k, for given image size, n, can be found by the Lcurve optimization method [6] considering processing time and implementation cost. Figure 3 shows Lcurves for n=100xlOo and n=200x200. The corners of the curve represent points where the speed up starts to saturate as the cost increases. In other words, it is the point where significant cost savings are obtained without significant high processing time penalty. From Figure 3 , the optimal multiplexing levels are obtained at k = 3 for a 100x100 image size, and at k = 5 for 200x200 image size. The cost of a 50x50 array (corresponding to k = 4 for n = 100x100 and k = 16 for n = 200x200) is around 30% of a 100x100 array and around 2% of a 200x200 array while keeping the slow-down factor less than 0.04%. The advantage of the multiplexing is clearer for large images. Of course this is a crude approximation. The time evaluation is dependent on the architecture of the circuits and required overlap size. The cost evaluation is dependent on the process. If a detailed time evaluation and a cost function are obtained then this method can deramine the optimal degree of multiplexing. The one advantage of this Lcurve method is that the time evaluation and cost are not necessary to be an analytical function. They can be numerical data with respect to the m y size.
Offset Cancellation
The CNN is a system of nonlinear differential equations with sparse coefficient matrix expressed as where C, R, a;, bi, I are constants,fis a nonlinear function, U; is an external excitation, C2 is an index set and Nj is a subset of 51. A large number of Scientific and engineering problems belong to this family of a system. Even though the equation it.self has very simple form, its hardware has some random process variations. Considering these variations, equation (6) becomes a system of stochastic nonlinear differential equations. If a building block is modeled by (a + a")(s) + a, where s is input, a is desired gain, a" is variation of gain and a, is output referred offset.
Then (6) for thejth cell can be modified to Some manipulation of (7) leads to
1
The first part in the right half si& of (8) is identical to (6). The second part is the dynamic perturbation that is dependent on the trajectory of state and input. This is mainly caused by the transconductance variation of the devices. The third part is the static offset that is caused mainly by &vice mismatches. It can be canceled out by providing additional compensation current for each cell.
Once the array is fabricated, the offset is not directly observable. If the input of the hard limiter can be disconnected from the state no& then the following indirect cancellation scheme will cancel out the static offset.
1.Ground the input of the hard limiter. Set U,, R and I to zero. Set the desired templates, a, and b$, for all cells.
2.Start the dynamics and tune the compensation current, l c o~, , so the x, stays equal to zero.
This eliminates the first two parts in (8).
Even though this scheme cannot cancel out the dynamic perturbation, it will cancel out all offset effect. This scheme requires tuning circuits in every cell. This additional circuitry may increase the silicon area and power consumption and makes the implementation of large array more difficult. A global cumpensation method was discussed in171 but as the area size becomes very large the global compensation becomes less effective.
Optimal Template Scaling with Programmable Resistor
The allowable absolute state voltage, x&,& is determined by the power supply. As a result, the allowable absolute current, I.llwed, that can be injected to the lossy integrator is limited by for proper operation with fixed R. While the maximum absolute current that is injected into the integrator is Since the actual 1,-is dependent on the templates, the linear range of the hard limiter should be small enough to secure the saturation for various templates. These situations are not desirable because the maximum possible signal swing should be used to minimize the offset effect. There are two solutions for this problem. One is to have a variable power supply for the integrator and the other is to have a variable resistor. The variable resistor may be an easier solution. With a programmable resistor, the template scaling becomes straight forward as follows.
Calculate L, by (10).
4.Choose R by (9).
Experimental Results
Four 3x3 array chips w&e fabricated through a 2pn doublepol; doublemetal CMOS process. In our design, only the state of the center cell is observable in continuous time. Four chip had random offset and it significantly degrades the performance of the chip. 
Conclusions
The fabricated test chip shows that only a few chips from the non-defected chips will work properly. This means that if the rate of pass in the testing and the testing cost are included in the cost, then the actual cost may be higher than the cost model used in this paper.
Considering a square image, the optimization analysis suggests an optimal 50x50 array for a 100~1M) image size. This array size is a suboptimal array size for a larger image. The yield results are around 70% yield for this 50x50 array. The implementation cost is only 2% of that of a 200x200 array while the slow-down factor is less than 0.04%. Considering all these points of view, a 50x50 array is deemed as an optimal array size with the current technology. Since the multiplexing architecture is suitable to adopt a pipelining scheme, real-time image processing may be achieved at low cost. If a new integration technology is introduced the optimal array size will be increased and the possible image resolution will be increased.
