Introduction
Although most of the CNN applications and corresponding VLSI implementations regard two-dimensional arrays, one-dimensional Cellular Neural Networks have recently received increasing attention. For the latter ones, quite different applications have been reported, ranging fi-om 1 -D signal processing [ 1-31 to instrumentation and control [4]. In particular a 1-D CNN architecture able to emulate the behavior of FIR filters and to perform the Daubechies Wavelet Transform has been thoroughly discussed in [l-31. Besides, unlike the case of 2-D CNNs for image processing, the applications reported for 1-D CNNs require a very reduced number of cells [l-41 . This paper presents a VLSI implementation of a new one-dimensional discrete-time programmable CNN suitable for the above mentioned applications. It is based on the well-known S21 technique. Moreover, the introduction of a time-multiplexing scheme allowed a more efficient use of the hardware.
A block diagram of the proposed architecture is shown in Fig. 1 . The main blocks are an analog shift register (i.e. a tapped delay line) and a set of locally connected cells. A cascade of eight delay elements (sr0-sr7) composes the shift register. Input data enters the shift register on the left side (through sr0) and the samples are shifted to the right, passing fi-om sr0 to srl and so on. The cells of the CJW receive their inputs fi-om the shift register and from the outputs of the neighbors. The cloning templates are provided externally making the proposed architecture programmable. The state equation describing the behavior of the cell at position c is:
where xc(n+l) is the updated state of the cell, yc(n)=f(xc(n)) is its output, uc(n) is the output of the shift register src (e.g. uI is the output of srl), A,,& are the feedback templates while Bc,d are the control templates, N( c) is the neighbor set of cell c defined as follows:
As shown in Fig. 1 , the proposed architecture includes six CNN cells and eight delay cells (the self-feedback is
Vdd
Data out -T- implicit in the cells of Fig.1 and it has not been shown to avoid clutter). In practice, in the real implementation, the first delay stage sr0 is not necessary and the input (Data In on Fig. 1 ) can be directly provided in place of it. Fig. 2 . A replica of the current at the output of the cell is needed for the next delay cell and for any other circuit requiring it as input. Therefore, a current mirror with multiple output branches is placed in-between any full-delay block. In particular, cascode current mirrors have been used in order to obtain very high output impedance. The circuit of a shift register cell is shown in Fig. 3 . An impedance is placed between the output and the input node of the half-delay cells in Fig. 3 . It represents an NMOST identical to the ones used for the switches with his gate connected to the positive power supply. It compensates for the ir drop due to the output switch [ 6 ] . 
The Shift Register
where Iofl(z) represents a constant output offset very close to I o~( z ) . Therefore, due to the sign inversion, when two of these cells are cascaded in order to obtain a full delay, the two offsets practically cancel each other. In our case, the first half-delay cell is loaded by the second one, while the current mirror loads the second one. Therefore the total transfer function of the two blocks in cascade is given by:
This implies that the delay cell slightly attenuate the delayed si@. This attenuation could be minimized making use of cascode transistors in the half-delay cell [SI. However, in the proposed design, it has been preferred to compensate it by choosing a suitable current gain for the current mirror following the delay cell. In fact, performing a Montecarlo analysis, the latter solution has shown a better robustness to parameter variation. Finally, in order to compensate a possible residual output offset current, another delay cell with zero input has been created. This cell's output current is the residual offset. This is copied and subtracted to the output of all the other delay cells by means of cascode mirrors (it is added to the intermediate node of the current mirrors using the terminal Of-c as shown in Fig. 3) . This is very similar to the well-known replica circuit technique [a].
CNN Cells
The cell's behavior is characterized by the state equation (1). The output of the cell is:
Two cascode current mirrors shown in Fig. 4 easily obtain this. The bias current Idd of the NMOSTs working as cascode current sources fixes the saturation current.
S21 Multiplier
Four quadrants S21 multipliers [7] have been used to implement the programmable templates. The circuit of the multiplier is shown in Fig. 5 . The multiplication of any two given currents x and y is accomplished by evaluating their quadratic terms:
In order to accomplish this, a current squarer circuit is used to obtain the three squares on the left-hand side. The squarer characteristic is:
where io and i, are the output and input current respectively, Ib is a constant current related to the bias voltage Vbrm, eJ is the output offset error, e2 is the input offset error. A four-step algorithm (corresponding to the four phases $ 4 4 ) is used to evaluate the left-hand side of (7). The control signals for the switches are shown in Fig. 6 . During 8, both x and y are added up at the input node of the current squarer. The squared result is stored in memory 1. During O2 only x is fed in, the result of the squaring operation is added to the previous result that was fed (and sign inverted) by memory 1 at node A. This sum is stored in memory 2. During 6' ' no input is presented to the squarer that feeds the residual at node A. At the same time the latter partial sum is fed (and sign inverted) at node A by memory 2 and the total current is stored into memory 1. Finally, during e4 the current y is squared and added to the previous sign inverted sub-total given by memory 1. The total current I,,, is provided at the output. Applying this algorithm to (7) by using (8) and assuming ideal behavior for the circuits and devices (infinite outputhput impedance ratio of the building blocks constituting the circuit) the following first-order relationship is obtained
Thanks to the third term in which no input to the squarer exists, the errors due to the offsets in the squarer are canceled out. And so the final result is proportional to the product of x and y. Due to the fact that the transfer function of any single S21 memory cell is inverted type, the order in which the various terms are evaluated and the intermediate results are retrieved is important. In reality, the samples retrieved fiom the memory cells are slightly attenuated. This would imply a not-exact cancellation of the quadratic terms in (7). In practice, most of the error comes 6om the fact that during O4 the output of the squarer is added to the attenuated partial result coming fiom memory The first two terms of the right-hand side of i in (lob) represent an offset, the last term is the desired result while the remaining terms constitute the nonlinear distortion. Note that 17 is very close but less than 1. Therefore the third and fifth terms, being multiplied by a factor $(q-l), can be neglected. Relationship (lob) can be rewritten as:
, 2 4 where io8is the offset current. The nonlinear error is canceled by setting the current mirror ratio as:
Assuming that (W/L), = (W/L),, the aspect ratio of Wb is obtained as follows: (wIL),, = (~I L )~~
(1 -p) 1 p.
However, the final design of h&b also depends on the actual voltage drop on the switch in series with Wb, The final expression is therefore:
It is worth noting that because q z l the offset is almost canceled as well. So finally:
The experimental characteristic curves of a prototype of the multiplier are shown in Fig. 7 . The multiplier's gain is q3p/21b = 5000A'. It is worth noting that the operands have to be kept constant at the inputs of the multiplier while the result of the multiplication is released on
Ancillary circuits
In order to bias the current squarer, the multiplier's input node (I3 in Fig. 6 ) must be kept at Vdd/2 = 1.5V. Therefore a common gate amplifier has been used as level shifter. Moreover, in order to have the same order of magnitude for the two operands of the multiplier, a current amplifier is placed at the signal input. Finally, on-chip linear I-V converters allow the external observation of the data in the delay line,
Cell Behavior and Hardware Multiplexing
From what was discussed in the above sections two remarks can be done. First of all, the shift register cells allow new data to enter and retrieve the stored data only during &. Data is internally exchanged between the two halfdelay cells on $2, The fkll delay cell is completely isolated by the rest of the system on this phase. During this second phase the rest of the hardware would be essentially idle. These two phases determine the time synchronization represented by the time index of the state equation (1). Secondly, the result of the multiplication process is required before the end of a clock cycle. Therefore, a strategy that permits to exploit the available hardware during the idle phase and to save in terms of area is outlined next.
Let us refer to the block diagram shown in Fig. 8 depicting a CNN cell. The programmable synaptic connections using the multipliers are drawn on the left-hand side of the figure. Those multipliers accept the outputs of the shift register ud on Cpl with the corresponding weights (namely the control templates BCJ provided by off chip currents. On Cp2, instead, the neighbor outputs yd are fed to the inputs of the multipliers in place of ad, while the feedback template Ac,d is fed instead of the control template Bc,d ~n other words, during 4 1 the weighted sum of the shift register outputs Bc,dUd(n + 1) is present on the summing node at the output of the multipliers. This sum enters the full delay cell. At the same time, the previous value of this sum is present at the output of the delay cell (because it entered the delay cell during the previous period n). Moreover, let us assume that the output of the half-delay cell, depicted below the full delay cell, is providing the weighted sum Ac,dYd(n) at its output. (this assumption will soon be proven). Therefore, the state xc(n+l) is obtained at the summing node of the two previous outputs, in accordance to the state equation (1). This one enters the second half-delay cell. Its output instead is zero and so is the output of the nonlinearity.
On 42 the weighted sum B,,dUd(n + I) is stored in the full delay cell. It is passing fiom the first half-delay cell to the second half-delay cell that constitutes the full delay cell itself. It is therefore completely isolated from the rest of the system. The cells outputs yd and the feedback templates are fed to the synapses. So the weighted sum C kfc,dyd (n + I) is fed to the input of the half-delay cell on the left-hand side of the figure. his value is available at its output on of the next clock period (namely on period n+2). This proves the above assumption on the output of this half-delay cell. On 4 2 there are no currents at the summing node on the right of the full delay cell. The half-delay cell on the left hand side releases its stored value (namely xc(n+l)) and so, due to the nonlinear block, the output yc(n+l) of the cell is available. This is consistent with the fact that the outputs yd(n+l) are provided as inputs of the cells during this phase. The above approach allow us to i) exploit the rest of the hardware during the idle phase 42 of the shift register and ii) use only three multipliers instead of six, saving in area and power. To obtain all of this, only two half-delay cells have been added to the classical CNN cell architecture. From this scheme it can be seen that while the outputs of the delay cells are available during the whole corresponding phases, instead, the sampling is accomplished only during ha+, ( 
Simulation results
HSPICE simulation results, at transistor and functional level, are here presented. Let us first consider some transistor level simulation results. In a first case a sinusoidal input is fed at the input of the tapped delay line and shifted along it. The outputs of the on-chip linear I-V converters corresponding to the delay stages srl, sr2, sr3 and sr7 respectively, are shown in Fig. 9a . The initial values present in the delays before the first sample reaches the stage can also be noticed.
The two input currents of one of the multipliers are shown in Fig.9c . One of the two inputs is fed alternatively with B and A while the other one is fed with U and y. The different signals on the distinct phases can be distinguished. It is seen that on phase 4, the corresponding B template coefficient is provided at one of the inputs while the output current coming from the delay line (U) is fed at the other one. Besides, the 4 phases OJ-e4 of the multiplier are distinguished from the fact that one of the inputs (the template coefficient) enters on OJ+BZ while the other input is accepted on el and 0,. Conversely, the template coefficient A is provided on phase 42 both with the corresponding output of the cell. A practical example of the functionality of the proposed architecture is now given. The voice of one of the authors has been sampled at 8 M z . White noise has been added obtaining a (4000 samples) noisy signal with signal-to-noise ratio SNR=1.1965. A SPICE macromodel of the chip has been simulated to process a wavelet decomposition of this signal (top of Fig.9b ) according to the algorithm described for the detailed description of the algorithm). The filtered signal (bottom of Fig.9b ) has a SNR=6.4288, an improvement of 14.6043dB. In Fig.9b the time shift among the two signals cannot be seen at this level of magnification. The noise reduction, instead, is visible, particularly where the low frequency components of the vocal signal are dominant (at the beginning, in the middle and at the end). 
Conclusions
The VLSI implementation of a discrete-time one-dimensional Cellular Neural Network has been discussed. One of the peculiarities of the proposed architecture is a hardware-multiplexing strategy. This allows to efficiently use the hardware halving the number of multipliers and storing the intermediate results into temporary memories. Simulation results at transistor and functional level have been reported. A CMOS N-well MOSIS Orbit 2p chip is currently in fabrication.
