R econstruction of given noisy data is an ill-posed problem and a computationally intensive task. Nonlinear regularization techniques are used to find a unique solution under certain constraints. In our contribution we present a parallel mixed-signal architecture which solves this nonlinear problem with in microseconds. By connecting all parallel cells in a circular manner it is possible to process noisy data vectors of infinite length. This is achieved by virtually shifting the nonlinear adaptive filter kernel over the noisy data vector. Additionally, we focus on the interaction between theory, discretization, numerical simulations, macro-modeling, and analog VLSI implementation for a theoretically well understood class of computer vision in an exemplary and paradigmatic way. A one-dimensional (1D) experimental chip has been fabricated using 0.8 mm CMOS technology. On-chip measurements are shown to agree with results from numerical simulations. Results from applying the 1D chip to nonlinear smoothing of twodimensional image data will also be given correspondence.
Introduction

Overview
An important research topic in computer vision is the variational approach to nonlinear preprocessing (adaptive smoothing) of images. The goal is to overcome current techniques which are based on weak heuristics and hardly manageable process parameters. Every progress in this research has immediate positive consequences for solving real-life applications of computer vision.
Adaptive smoothing of images means computation of smooth regions and preservation of intensity transitions using raw image data. Since local computer vision algorithms for segmentation of regions and detection of transitions behave complementarily, the necessity of integration of both processes into one approach is obvious. A recent formulation of this problem is an optimization scheme in the context of variational calculus, a view which subsumes conventional approaches in computer vision (e.g., ''split-and-merge'').
Since all approaches from this class result in iterative solutions of sparse nonlinear equations, analog VLSI implementations allow for real-time application, i.e. at (video) frame rates.
The most powerful, in the sense of general, variational approach was suggested by Mumford and Shah [1] . The theory of this model (e.g. properties of solutions) are subject of an intensive mathematical research and, with focus on efficient algorithms, a lot of work has to be done. A drawback of this continuously formulated approach and its discretization is that there is no continuous dependence of the results and the input data. Furthermore, as yet, an analog VLSI implementation seems to be impossible to achieve.
A class of variational approaches which are less powerful but have globally favorable properties was proposed by Schno¨rr [2] (see the Problems statement section). As a special case this class subsumes a continuously formulated approach which resembles an approach by Harris et al. [3] , suggested for an analog implementation, but avoids some of its drawbacks (e.g. dependence on coordinates, non-controllable smoothing behavior). Furthermore, our underlying mathematical model allows for the design of approaches given arbitrary data types (color, motion, etc.).
Analog VLSI technology experiences a remarkable renaissance in the context of hardware implementation of CNNs, opto-electrical systems (vision chips), and micro systems. Advantages are, for example, time-and amplitude-continuous processing or the possibility of massive parallelism with a current-dynamic range of seven decades. Main problems of an analog implementation are the device imperfections (mismatch) of MOS components, the storage of analog signals, the stability of feedback circuits, and the CAD support for the design of circuits with higher complexity. The mismatch of MOS components can be handled, as far as possible, with symmetrical component layout and furthermore with an individual adjustment of single circuit units. Storage of analog signals can be achieved for short periods with on-chip capacities, for medium periods with amplitude-quantization along with refreshing, and for long periods with isolated charge-packets in floating gates. Up to now, the automation of the design of complex analog circuits is not entirely possible. For the hierarchical design macro-models are created, nevertheless the designer has to know details about the model structure and physical parasitic effects. At this point, the analog design differs from digital design and resides for the time being in the domain of electronic engineers. For the implementation of resistive nonlinear networks one has to carefully consider the monotone behavior of nonlinear characteristics, which is mandatory for stability since otherwise global oscillation may occur. This condition of monotonicity should also not be violated in a parasitic or dynamic manner by mismatch of circuit units and represents consequently a non-trivial boundary condition for the circuit designer.
Related work
Low-level feature extraction and image smoothing are key issues in image processing and computer vision. Variational approaches [1, [4] [5] [6] provide a mathematically concise problem formulation (cf. the survey [7] ) being superior to ad-hoc smoothing schemes. A common problem with these approaches, however, is their computational complexity from an optimization pointof-view. Stochastic optimization [4] is not feasible for typical image sizes, and deterministic annealing procedures [5, 6] cannot guarantee to obtain a ''good'' local minimum. Therefore, the use of non-quadratic but convex functionals has been advocated to simplify image smoothing from a computational viewpoint [2, 8, 9] . Furthermore, although being much simpler, convex functionals nevertheless provide reasonable approximations (cf. [10] ) to the prototypical but mathematically and computationally sophisticated variational smoothing approach of Mumford and Shah [1] . Despite of efficient digital implementations of numerical schemes for solving variational equations [11] , the computational effort prohibits applications for which frame-rate performance is required. As an alternative, analog hardware concepts allow to map classes of early vision problems onto time-continuous high-speed analog circuitry by the direct use of voltage, current and charge relationships of the physical devices [12] .
The behavioral characteristic of electronic circuits as well as the solution of variational equations can be described by nonlinear differential equations [2] . An analog circuit can solve the equations in fractions of a second. The speed is limited only by the unavoidable parasitic capacitors and the finite power dissipation [13] . This is the major motivation for the design of analog VLSI networks suitable for solving such early vision problems.
Inspired by biology, Koch and Poggio [14, 15] investigated the relationship between mathematically stated regularization schemes and analog networks.
K . W IEHLER ETAL.
Kirchhoff's current and voltage laws, which represent conservation and continuity restrictions satisfied by each network component, solves for the regularization. They derived a direct relationship between the generalized energy of a nonlinear resistor (co-content) and the theoretical regularization expression. From this viewpoint, the properties of the electronic devices can be analyzed regarding their effects on the solution. Mathematically well defined characteristics like convexity, consistency, and uniqueness can be controlled directly by the features of the used electronic devices. Therefore, it is possible to describe prerequisites on the circuit level which were originally defined on higher mathematical levels.
For the realization of those networks, substantial work has been carried out by Mead et al. [12, 16, 17] by utilizing sub-threshold standard MOS technology for massively-parallel analog signal processing. Good examples are optical motion sensors, silicon retinas, etc. By including optical sensors (CMOS-compatible photo diodes and transistors) vision systems can be implemented on one single chip. For current successful implementations see, for example, [18] .
Build upon Mead's fundamental research, Harris [3, 19] implemented nonlinear resistive networks for early vision tasks. Utilizing resistive fuse elements he implemented the non-convex approach of Blake and Zissermann [5] (weak membrane model). He achieved a robust solution by a graduated convexity mechanism which is inherent to his resistive fuse circuit. The socalled tiny-tanh network is a further development of Sivilotti's [20] nonlinear saturating resistor. By introducing a high-gain positive feedback, the network resembles a discrete TV-norm realization [21] . Nevertheless, the implementation lacks both an efficient input-output mechanism and the possibility for controlling the regularization parameters.
Most recent implementations of resistive fuse algorithms utilizing floating-gate MOS-transistors for both, controlling the transfer characteristic and long-term error compensation, can be found in [22] .
Also, in the last decade a lot of research in the field of massively-parallel hardware has been triggered by the Cellular Neural Network (CNN) theory introduced by Chua and Yang [23, 24] . A CNN is defined on a timecontinuous, discrete grid and consists of an uniform array of analog nonlinear dynamic computing cells. They are interconnected in a local neighborhood and contain no learning scheme or adaptation mechanism and therefore are best suited for direct analog implementation. Discretized formulations of the regularization equation proposed in this paper (see Theory section) can be restated in the CNN framework [25] as well. The model has to be altered to the hardwarefriendly full-range model [26] using nonlinear feedback templates. Nevertheless the CNN framework is defined on a discrete grid and therefore limited in its applications.
Various CMOS VLSI implementations of CNNs have been reported [27] . They can be discriminated by fixed [28] or variable [29, 30] templates, current [26] or voltage mode [31] , full [32] or limited [33] CNN-model, on-chip [34] or off-chip sensors. They are referenced here in detail as a starting point for further investigations in the field of analog VLSI systems for signal processing.
Contribution and organization
In our paper, we present a massively-parallel VLSI hardware implementation of a nonlinear smoothing approach (one-dimensional (1D) case, 32 nodes) for applications in computer vision as well as a special architecture for processing 1D data streams. By analog VLSI technology, limits in performance of iterative solutions on digital computers have been overcome. In particular, the interaction between theory, discretization, numerical simulations, macro-modeling, and analog VLSI implementation for a theoretically well understood class of computer vision methods have been investigated in an exemplary and paradigmatic way (see also Figure 1 ). Characteristic of our work is the thorough validation on all levels of description and implementation.
In the second section we propose a convex variational approach for adaptive smoothing and sketch a 1D FEM discretization. We focus on the nonlinear characteristic transfer function whose analog implementation proved to be the hardest task for an analog circuit design. The feedback of circuit-mismatch for the case of the characteristic transfer function on the mathematical model is described in the third section (Transfer function constraints). Detailed investigation results in constraints for analog circuit design. Then (in Analog cells) we follow on with a brief review of technology imperfections in the context of analog VLSI circuits. Based on this, we present a constraint-driven top-down design for the core analog cell used for the 1D implementation. In
VLSI IMPLEMENTATION 3
the fourth section a systolic architecture tailored to timecontinuous analog signal processing is described in detail. The results achieved with an 0.8 mm CMOS implementation of the so-called dynamic circular network (dCN) are shown in the fifth section. We conclude our paper in the final section.
Theory
Problem statement
A convex variational approach for adaptive smoothing was proposed by Schno¨rr [2] which results in the unique minimizer u of the strict convex functional
for a given data g and the non-quadratic function
The approach comprises essentially two parameters: l h determines the degree of smoothing, whereas c r controls the adaption to signal variations. For the latter, the smoothing process becomes anisotropic at loci with high gradients and gradually stops at such loci along the direction of the image gradient. Hence, essential signal structures are preserved. Finally, l l is a small positive constant (l l ¼ 0.1, for example), the meaning of which is not relevant for the investigation presented here.
The global minimizer of Eqn (1) is the unique solution to the following variational equation [2] :
where the function r(t):¼l'(t)/(2t) (see Figure 2 ) characterizes the adaptive smoothing process (1).
Partial integration shows that, at least formally, u may be considered as the steady-state solution of the nonlinear diffusion equation [2, 35] :
with Neumann boundary conditions.
Discretization
For the discretization of variational problems, FEM is the first choice. FEM can be applied in a mechanistic way, boundary conditions are incorporated automatically, and the discrete approach obtained is consistent in the sense that, under certain conditions [2] , discrete solutions u h to (3), for example, converge to the continuous solution u for vanishing mesh width. In this way discrete versions of our approach allow to maintain favorable properties of the underlying continuous problem formulation, like the rotational invariance of 
K . W IEHLER ETAL.
the smoothness term, for example. For a sound introduction to FEM we refer to, e.g. [36] .
The basic idea of the Finite Element Method is the restriction of optimization problems to finite-dimensional subspaces. Let {0 ,. . ., 0 n } be basis functions of a finite-dimensional subspace H h & H. Then, the restriction of (3) to H h reads:
with minimizer u h 2span{0 o ,. . ., 0 n }. If we define the isomorphism
and the mappings
then the solution of (5) is equivalent to the solution of the nonlinear system:
As mentioned above, we know from FEM theory that the solutions u h converges to the solution u of (3), if the formal discretization parameter h approaches zero [2, 36] .
Now we apply the FEM to the case of 1D discrete signal data, with a constant sampling rate of 1. Thus, our domain reads O ¼ [0,n]. The first step is to interpolate the signal data in a linear manner. Next, we assign to each sample x i a basis function 0 i which is uniquely defined by the following conditions: 0 i ðxÞ is linear within each interval ½x i , x iþ1 :
Eqn (7) now reads:
The integral terms in (9) vanish for all i À k j j41. The remaining integral terms can be computed analytically to obtain a sparse system of nonlinear equations in terms of the sample variables u. Evaluating (9), we get
The discretization of (4) reads
with the transfer function f nl ðtÞ ¼ & t j j ð Þt (see Figure 3 ).
For solving (8) in case of 2 or 3 dimensions on digital computers, we have developed different efficient numerical schemes [11] . Although efficient implementation on parallel computers is possible, the computational effort 
VLSI IMPLEMENTATION 5
prohibits applications for which (near) real-time performance is required. As an alternative, analog hardware concepts allow to map Eqn (8) onto time-continuous high-speed analog circuitry (see next section).
A Constraint-Driven Design
Transfer function constraints
In this section, we report on modeling the variational Eqn (11) with focus on the circuit implementation. Due to the limited accuracy of analog circuits (see Analog cells section) a proper analysis of deviations from the theoretical case is necessary. Since the nonlinear part of (1) dominates the image smoothing behavior, we emphasize on the modeling of the transfer function f nl (t).
The nonlinear transfer function f nl (t) is implemented by a simple amplifer-limiter circuit (see Figure 4 ) which is specified as follows:
The approximate transfer function (12) corresponds to (2) for the case of l l ¼ 0. Due to the real circuit behavior, the transfer function is monotonously increasing. Our experiments have shown, that small variations of the shape of transfer function f nl (t) do not decisively influence the results of the image smoothing process.
However, the results are more sensitive to offset errors and inaccuracies of the smoothing parameters l h and c r in the transfer function (see Figure 5 ). These errors, which are caused by imperfections of the underlying semiconductor manufacturing process, can be categorized further as follows (see also Table 1 ):
(1) the input offset error, which is mainly caused by threshold voltage deviations of the input differential pair; (2) the output error, which is caused by current mirror mismatch in the output current limiter; (3) errors in the saturation level (also caused by current mirror mismatch); and (4) gain errors of the input stage.
Regarding the analog implementation we expect two types of errors; on the hand, global (or systematic) errors, which affect all nodes in the same way, and on the other hand, local (or stochastic) errors, which affect each node individually (for a review on error in analog circuits see next section).
A thorough investigation of all errors, and also of the interaction between them, has been carried out* which we summarize in Table 2 . In the first row, limits for the global error d are given. The second row lists the limits for the local error s (normally distributed offsets and parameters with standard deviation s). It can be seen that the influence of local error is stronger than the influence of the global error.
Analog cells
As described in the previous section, imperfections of the underlying micro electronic devices may cause a serious degradation of the results. Especially for analog cells, the exact performance cannot be predicted and thus, the design process must consider technology imperfections in an early stage.
In analog circuit design the resulting error can be categorized as shown in Figure 6 . Systematic errors are caused by different operating points of the various transistors (e.g. a simple current mirror copies the current in an exact way only if the source drain voltage is equal for both). They can be reduced either by carefully choosing the circuit topology and/or transistor parameters or by adding auxiliary transistors which keep the substantial transistors in a constant bias condition (cascode transistors). These errors can be simulated very accurately with standard simulation tools. Therefore, their impact on the results can be predicted by appropriate back annotation into high level simulation tools (see preceding section). For the application of adaptive signal smoothing the most important systematic errors are the offset error and the saturation level of the nonlinear transfer characteristic (see Figure 5) . The error should be as small as possible and independent of the data input and output.
Random errors are caused by technology imperfections [37] . Due to this random influence even equal transistors in the schematic behave slightly different in their corresponding realization (mismatch). Moreover one can discriminate between intra-and inter-die variations. The first one takes only the transistors on one chip into account whereas in the latter case the variations between transistors on different chips are considered. For most applications of analog computational circuits, the performance is determined by the interaction of matched neighboring transistors (differential pairs, current mirrors, translinear loops, etc.). The accuracy is reduced by the mismatch of these devices. Therefore, the intra-die variation is the most important error source. The larger errors due to interdie variations have to be small enough to guarantee sufficient bias condition on all chips (yield) for a proper performance. For the case of cascading multiple analog chips an appropriate compensation method has to be implemented [38] . 
VLSI IMPLEMENTATION 7
To reduce the impact of the imperfections, different levels of the design hierarchy have to be taken into consideration: (i) algorithmic level; (ii) schematic level; and (iii) layout level. The more abstract the design level the higher the possibility for essential reductions. First, regarding the algorithm level, the system has to be insensitive regarding the mismatch, e.g. this can be achieved by reducing the input output value range (increasing signal to noise ratio, e.g. CNN full range model [26] ). In our case the algorithm is continuously dependent on the regularization parameters and therefore inherently insensitive to slight variations of the transfer function characteristic (see second section).
On the schematic level, the type of signal representation (voltage, current, pulses) has to be considered carefully. The choice depends on either the preferred input and output signals and the mandatory signal transformations. Furthermore, the operation mode, weak or strong inversion, has to be taken into account. Strong inversion circuits offer better accuracy and higher speed compared to circuits in weak inversion mode [39] . Nevertheless, the latter ones have a bipolarlike transfer characteristic which could be used to build efficient circuits for special applications (translinear circuits [40] ).
On the layout level careful placement and routing techniques have to be used to reduce mismatch as far as possible [41] .
By introducing the physical normalization values I unit , V unit and g unit ¼ I unit /V unit for the current, voltage and transconductance respectively, the system's Eqn (11) can be mapped to a voltage/current relationship:
Where v g i ¼ V unit g ii and v u i ¼ V unit u i are the corresponding voltage values of the input and state variables used in the implementation. f ota nl corresponds to f nl in Eqn (11) taking a voltage as an input and a current as an output signal. C is the capacitor which determines the time constant t of the system: t & Cg unit .
The key element of the nonlinear relaxation scheme is the controllable piece-wise linear transfer function f ota nl in the above equation. The circuitry for this element should have minimum offset and saturation value errors. The smaller these variations the larger the range of the relaxation parameter can be made (previous section). Considering Eqn (12) again, an architecture based on operational transconductors has been chosen ( Figure 7 ; the switches are used for the reconfiguration as described in the next section). Input and output signals are voltages while the summation as well as the nonlinear operator are realized with currents.
The circuit used for the nonlinearity is depicted in Figure 8 . It is a standard OTA followed by a current limiter. The systematic errors are further reduced by introducing cascode transistors in the output stage. The random offset value of the circuit is dominated by the offset voltage error of the input differential stage. For low variations, the signal to bias ratio is an important Figure 6 . Errors in analog VLSI.
K . W IEHLER ETAL.
figure in circuit design [30] . Since the bias current of the limiter is in the range of the saturation current itself, the variation of the saturation current is as low as possible. Furthermore the layout was done after careful consideration of the matching topology in the circuit. All circuits operate in strong inversion. The current levels are in the range of micro amperes which results in relaxation time constants of microseconds for the used capacitors (&1 pF).
Based on an analytical error analysis, the simple core circuits were optimized regarding their minimum possible variations [42] . With these results, area and power consumption can be calculated for the specification of the application. The result of a Monte-Carlo simulation for the used nonlinear circuit is depicted in Figure 9 . The left and right histograms show the offset voltage error before (at point A in the schematic in Figure 8 ) and after the current limiter (at the output node). As expected from theory the error is dominated by the input stage. With these results and the constraints given by the high-level analysis of the algorithm the impact of mismatch can be predicted accurately for the system level.
Analog architecture-dynamic Circular Network (dCN)
Based on the 1D discretization of the convex functional (11) an architecture has been developed. The design has been driven by required range of the parameter space (c r and l 2 h ) and in addition relies on a perturbation analysis indicating the minimum accuracy necessary to achieve satisfying performance (see previous section). 
VLSI IMPLEMENTATION 9
A dynamic circular architecture for processing infinite signals A major problem in designing analog circuits for signal processing is the interfacing to the signal source and the subsequent (digital) processing unit. When the signal is generated off-chip, the signal must be fed into the parallel structure of the analog unit. Usually this is done by simply writing the input to the isolated cells. During this time all cells act as memory elements, and afterwards the array ''computes'' the result within microseconds which is subsequently read out (e.g. [30] ). Such a design yields time a rather poor computational performance since the effective computation time is low in comparison to the time used for data input/ output. Another problem arises when the signal vector is larger than the chip cell array. In this case the signal vector has to be broken apart into overlapping blocks which have to be processed sequentially. As a consequence, the parts have to be merged in a further postprocessing step [43] .
Our solution to these problems is based on a dynamic reconfigurable circular architecture which is shifted along the data vector at a very high rate [44] . For the 1D case the neighborhood N r (i) of the cell C i within a radius r on a grid D is defined by (Figure 10(a) ).
hÁi N is the modulo N operation.
The shift mechanism of the network first replaces the static cell links with programmable switches and, second, allows each cell to perform in different working modes (see (16) - (18) 
below).
Let u i denote the state variable of the cell C i . Following Eqn (11) the dynamic behavior of the cell is defined by
N(i) is the local neighborhood modulo the network boundaries as defined in (14) . The terms remaining constant during relaxation for node C i are denoted by e g i for inner cells e g i 0 and for boundary cells. To operate the network in the dynamic manner as described above, each cell can be programmed to satisfy one of the following conditions (all index calculations are modulo N):
fully connected cell left ðrightÞ boundary cell with :
vs: Neumann ðzero fluxÞ cond: constant boundary cell with :
Dirichlet ðfixedÞ condition :
Furthermore, a cell can be disconnected from the active network for calibration purposes. No long-term 
10
K.WIEHLER ETAL.
analog storage devices are necessary since the calibration process is repeated periodically.
The operation of a 1D dynamic network is sketched in Figure 10(b) . The analog values reside in the corresponding cell while only the connections of input, output, and boundary cells are reconfigured. The analog network processes the data while it is dynamically reconfigured. Moreover, the reconfiguration process is simple because only a maximum number of six cells have to be accessed per cycle.
When no processing is performed, the network acts like an analog shift register with a delay of N' multiplied by the sample time period. The effective length N' is the distance between the input and output node. It depends on the size of the calibration pool and available boundary conditions (see Figure 10(b) ). For optimal Dirichlet boundary condition
Á at the end of the network, the effective length N' can be chosen as large as possible since there will be no reflection from that boundary node.
Signal propagation
The architecture depicted in Figure 10 can be seen as a systolic architecture. In difference to conventional systolic systems where the data is ''pumped'' through the computational units, in this approach the connections between the cells are altered to satisfy the systolic behavior. In the following paragraphs the process of reconfiguration is reviewed in an analytical way. For simplicity the logical network length N' is set to N, the number of cells.
Let u k i denote an index operation modulo N in the kth step: u k i ¼ u hi7ki with i the logical node number and hi7ki the hardware node number (see Figure 10(b) ). By applying Eqns (16) - (18) to the simplified 1D discrete case of Eqn (11), the system equations can be formulated as follows
for kT5t (k+1)T while the cell inputs are g Furthermore an ''area of influence'' is defined in a neighborhood of the cell C j : i À Ã5j5i þ Ã: L describes the maximum distance between two mutually influencing cells. The value of this crucial distance is 
VLSI IMPLEMENTATION 11
determined by the characteristic space constant (e.g. diffusion constant) for the underlying signal processing task. As a matter of fact L is also limited by the size of the network and the required or achievable accuracy of the used analog circuit.
In a conventional parallel processing mode the diffusion constant has to be chosen to satisfy L p 5N/2, with N being the number of parallel working nodes. This is the minimum condition for a correct result at the center position. For the dynamic circular mode L can be doubled (L c 5N ) since N is the size of memory in the network (Figure 11 ). Of course this holds only for systems where the time constant t is smaller than the sample time T. For analog systems, this can be guaranteed by the extremely fast settling time of the network. Figure 12 shows the impulse response for different network architectures. The dotted line is the correct propagation for a linear diffusion (L % 30) computed on a 1D network of 200 nodes operating in parallel. The dashed line is valid for a parallel network of only 16 nodes, whereas the solid one holds for a network of the same size operating in the dynamic circular manner. It can be seen that the difference in impulse response 
12
between loop-mode and the correct propagation result is much smaller than between the parallel and ideal mode. Due to the fixed value boundary on the right, the memory length is extended to infinity resulting in an asymmetric impulse response.
1D 0.8 mm CMOS realization and 1D/2D experiments
Based upon the algorithm and architecture described in the previous sections, a 1D prototype consisting of 32 identical analog cells was fabricated (see Figure 13 ). The major challenge was the design of the nonlinear function which has to be very accurate to guarantee minimum impact of transistor mismatch (see Analog cells section).
The dynamic routing resources are included by MOS switches which are controlled by a digital unit ( [44] , Figure 7 ).
Measured results for different values of c r are depicted in Figure 14 . c r ranges from 0.01 to 0.05 for an approximate constant value of l 2 h . It can be seen that for smaller values of c r the salient signal structure is preserved while for increasing c r the process degenerates to a linear diffusion.
The measured signal can be compared to the simulation results computed by solving the nonlinear equation for c r ¼ 0.01 and l 2 h =8 in Figure 15 . The difference between these signal vectors is mainly due to offset voltage deviations between neighboring cells. Whereas in smooth areas the difference is as low as the offset voltage itself (a few millivolts), in steep portions of the signal the difference can be large since the adaptive behavior is slightly different.
In further experiments an overall random error of about s ¼ 2.5 mV (standard deviation of the voltage differences between input and output of band limited signals) was measured. This first experimental design can successfully and efficiently perform computationally intensive nonlinear smoothing of 1D signals.
In addition to the 1D experiments, we carried out several experiments with 2D gray-value images (see Figure 16 , [45] ). The goal of these experiments was to investigate from a practical perspective to which extent even only the 1D chip with its reduced design complexity is capable of satisfactorily smoothing 2D image data in a nonlinear fashion. In this case, the image data was fed into the 1D chip in a consecutive row-by-row and column-by-column manner (note that for this input 
VLSI IMPLEMENTATION 13
scheme the result is not fully consistent with the 2D theory since only an orientation-constrained dual nonlinear diffusion along orthogonal data vectors can be achieved). The promising results shown in Figures 17 and 18 demonstrate the validity of this pragmatic approach since adaptive nonlinear smoothing as well as preservation of contrast edges can be achieved. For this experiment, the partial derivatives u x and u y have been computed via simple finite differences whereas the threshold has been set to values slightly above the theoretical value of c r such as to reduce the influence of the inevitable noise of the analog components. Our current experience with the 1D experimental chip indicates a peak performance of 500 kHz/pixel (equivalent to 130 ms/image). Since this performance is not limited by the relaxation time of the 1D network of cells itself but solely by the rather weak performance of the currently used readout circuitry, it can be even further improved in the future through appropriate design measures.
Eventually the experiments showed that the 1D chip has the potential of a powerful but cheap image preprocessing device.
Conclusion and further work
An efficient analog 1D hardware implementation of a nonlinear diffusion algorithm has been proposed and fabricated. Substantial problems (e.g. input/output of the data, limited number of cells) in the design of massively-parallel analog hardware have been overcome. The analog network cells are connected in a circular structure. Due to the dynamic reconfiguration of the connections a nonlinear adaptive filter kernel can be shifted virtually over a signal vector of ''infinite'' length. The boundary conditions have to be configured as von Neumann (zero flux) or Dirichlet (fixed) boundaries. Furthermore, our design has the potential for calibrating cells, that are not in the active part of the network, without disrupting the data stream. The storage duration is constant for all analog signal samples. Consequently, linear systematic error effects resulting from leakage currents can be compensated as well. A prototype containing a 1D nonlinear network has been fabricated in 0.8 mm CMOS and is fully functional. The chip can also be used for 2D image smoothing provided that the result of successive 1D relaxation along the rows and columns of the image is acceptable for a given task. 
K.WIEHLER ETAL.
Further work will focus on a 2D realization being fully consistent with the multi-dimensional theory. In addition, effective error reduction methods and efficient architectures for higher dimensions have to be investigated in order to exploit the potential of parallel analog hardware for computer vision. Electronic files are not used for Fig. 16(a) . Please check.
AUTHOR QUERY FORM
HARCOURT PUBLISHERS
Figs. 11, 12, 13 & 14 are placed before reference. Please check.
