Introduction
Two of the main problems in the hardware implementation of Cellular Neural Networks (CNN's) are the required number of communication paths, and the size of the cell. However, the main bottleneck to speed up the computation time is the inputloutput communication delay with each cell. This can be even several times higher than the actual computation time of the CNN array. Several attempts to carry out the U 0 communications in the frequency domain have already been pursued [2-4]. The-technique that seems more viable of being implement in a CMOS IC with a relatively large number of CNN cells is the Wave-Parallel-Computing approach proposed by Yuminaka et al [4] . This technique is based on Frequency Division Multiple Access Amplitude Modulation (FDMA-AM). In this scheme, all the information in a communications channel is modulated by a finite set of different carriers. Additionally, all arithmetic operations can be performed on the modulated waveforms, leading in this form to an inner product computation in real time and without the need to demodulate. This clearly permits a highly parallelism within the system by transmitting, multiplying and accumulating at the same time. This conveys us to the concept of Transmit Multiply and Accumulate (TMAC) similar to the concept used in the DSP terminology and this approach is not only simple, but also it leads to a much simpler hardware implementation that the one required with traditional techniques.
For near real-time video applications, a continuous time CNN usually requires that its inputs be present at all times. However, the communication links can take a large proportion of the integrated circuit area, that is why current implementations do not have a very large number of CNN cells in spite of the fact that the CNN cells can be efficiently implemented. A time-multiplexing architecture has been proposed to alleviate the problem of communicating the extemal input data with the array [5] . Nevertheless, under these conventional implementations full parallel access to the entire CNN array is not possible. It will be shown that with the scheme hereby proposed, full parallel access is possible using a reduced number of wires by assigning multiple frequency bands to each connection wire. The main advantages to the system hereby proposed are the following:
Reconnection programmability. CNN cells are not required to occupy a physical matrix array. Each cell has its own characteristic frequency of operation. This allows interconnecting cells with any other cell in the array. By assigning contiguous frequency channels to both the templates and the cells a simple implementation can be obtain to mimic the conventional CNN array. Easy expandability. The CNN array can be easily expanded by adding another CNN chip and programming its frequencies. This is an advantage compared with conventional CNN implementations since the implementation of large arrays in a single chip do not have direct extemal pin connections with the neighboring cells that makes not possible to add an additional array interacting in real-time with the first array. Dynamic programming of the template with different sizes. This will allow to expand and to reprogram the neighborhood dynamically. Neighborhoods with radius greater than 1 can be easily programmed This is a major advantage compared with any of the conventional arrays since their connections can not be programmed because they are limited to the physical hardware connections. It can be seen that the first term has a base band component (cos 0). Notice that the coefficients of the base band component are ajj yo, which are the terms corresponding to the inner product <A,Y>. By using a lowpass filter with a cutoff frequency such that a+ass e I Q, 1/2 then we obtain only the base band elements.
which is what we want.
A second inner product can be computed simultaneously if the required output is <A,Y> + <B, U>. This can be done as long as the frequency bands assigned to A and 9 do not overlap those assigned to B and 6 . As a special issue consider the case when a coefficient is negative. Then we have that -k cos(@ = k cos(wt + 70.
Notice that the sign information is contained in the phase. This is how the proposed modulation scheme differs from the conventional amplitude modulation in which a DC offset higher than the maximum possible value of the constant is added. The system block of the modified proposed is shown in Figure 1 . In this diagram Templates A and B and the Input U and Output Y are modulated by a set of different frequencies. Due to the fact that the neighborhood was chosen to be one, the number of different frequencies needed per cell is nine. Special attention must be paid to the fact that all the addition and substraction of frequencies can lead to terms that can be near the low pass frequency bandwidth. Therefore, a practical frequency assignment is done by separating the carrier frequencies by a proportional factor a, e.g. fi=6&,, f~26&,, f3=3@, .... This factor is chosen as a low pass filter design constraint which leads to the specification of the filter order. The number of frequencies required for each individual TMAC cycle depend on the size of the neighborhood, e.g. for radius 1, nine frequencies are required.
In Figures 2(a) and 2@), template B and input U are modulated by a set of carriers leading to an FDMA-AM-DSB system, this is shown as a set of weighted impulses. These waveforms are then multiplied to realize the inner product computation. A similar procedure is done for Template A and the output values Y. From Figure   2 (c), it is shown that the cross product has a large DC component (desired result) and a series of cross terms.
Some of the advantages of using this technique are: the oscillators can be shared, the required filters are very simple, the technique is all analog and the building blocks are simple. Some of the disadvantages though are as follows: the required number of frequencies is high, the frequency assignment is dependent on the effect of the harmonics in the system (quality of oscillators), the time response of the system is a function of the frequency assignment @C filtering), it is also sensible to offsets, and the oscillator reference must be shared among cells. 
Modeling System Non-idealities
All derivations use normalized units to determine the best architecture for the system and to not assume that the processing is in voltage or current mode. Time constants are treated in a similar way. For the purpose of evaluating the approach, we considered the following figures of merit: i) Algorithm convergence which is a qualitative solution when the CNN algorithm is met, ii) State convergence which indicates when the system arrives at steady state, and iii) Filter convergence which corresponds to the steady state convergence of the low pass filter. For simulation purposes, the integration step was set to 0.1 time units and R=C=l. As way of example, for an edge detection algorithm applied to the image of Fig. 3% and without 2ofi is the offset added at the output of the modulators, li;-is the offset added at the output of the multipliers and THD represent the total harmonic distortion added at the third harmonic.
We have considered a 6& and a 2nd order low pass Butterworth filter with cutoff frequencies at aJ4 and aJ16, respectively. Observe that the constant frequency scaling of the carriers allows us to relax the filter specifications. The only difference between the two filter specs is that the algorithm convergence for the second order filter increases to 12 cycles, but the complexity of the filter is greatly reduced. AM is a coherent demodulation, and thus there is a very small range of frequency deviation that is allowed. Our simulations indicate maximum deviations in the range of H.l% for correct algorithm convergence. Similarly, the phase difference must be within a small range. This implies that only one set of carriers must be used. It follows then that this reference frequencies must be supplied by one common module to other, or they must be supplied externally to both (this option is preferred to avoid design complexity on either module). where for simplicity Ads= . Notice that the inner product needs to be scaled by 2 to normalize values. We have assumed that the nominal amplitude of the modulating waveforms is 1. From simulations we found that higher values help to arrive at the algorithm convergence faster and lower values tend not to converge to the correct solution. The modulating signal can have a range of 510% amplitude variations without an appreciable change in the algorithm convergence (for edge detection). These values determine the amplitude quality of the modulator. These variations can be modeled as yij = yyij cos(ap,r) -1 I y I 1
(1 1b)
where d and are uniformly distributed random variables. There exists also a possibility of having offset at the output of the activation function. However, our simulations indicate that this perturbation does not have a large impact on algorithm convergence, but instead it does on the convergence of the fl saturation levels. The perturbation can be modeled by adding a uniformly distributed random variable, YOB , between 1 and -1, to yo in (8b). To process the inner product, say CA, Y>, the modulated waveforms of A and Y need to be multiplied together. A DC offset at the output of the multiplier has a direct impact on the correct convergence of the algorithm. Also it is important to note that this perturbation has a different impact depending on which template is used. This new offset effect can be modeled as The nonlinear distortion added by clipping the signals to the power supplies affects the algorithm convergence. Essentially, the clipping effect adds unwanted harmonics to the signal. A value of Pa.01 (i.e. scaling by 100) is enough to meet a safe power supply value within S units. This scaling must also be evaluated with respect to the noise floor. Observe that a 1 unit amplitude carrier is scaled to 0.1 units, and after multiplying by a similar value carrier the signal is further reduced to 0.01 units. Therefore, once the signal acquires a value comparable to that of noise, it is imperative to consider the signal to noise ratio. In other words, a noise analysis must be made in order to determine the noise floor and thus the minimum scaling that can be set in the system. The result of this analysis sets the power supply range and if this is fixed, other parameters need to be modified accordingly and evaluated. To evaluate the S N R an additive white gaussian noise (AWGN) vector is added to the FDMA channel. Because the wanted portion of the signal is the DC level, a noise with mean 0 has very little effect on algorithm convergence and the S N R can be as high as -5dB. Total harmonic distortion is evaluated by sweeping the amplitudes from the 2nd harmonic to the 9th harmonic. We performed this sweep for a range from 0% to 30% of the nominal amplitude's value. We found that the largest distortion that can be tolerated in odd harmonics is 20%. The system cannot tolerate even harmonics so a fully differential architecture is suggested. Table 2 lists the individual parameter variation and the ranges for which was algorithm convergence obtained. 
Parameter

Simulation Results
All parameter variations represent a worst case scenario in which only the parameter in the respective analysis is modified and all other parameters remain at their nominal values. The most relevant parameters obtained from the simulations were the ones generated by offsets at the output of the modulators, at the output of the multipliers and also the total harmonic distortion that can be tolerated on the third harmonic. These parameters define the quality of the block implementation at the transistor level. So, a combined simulation without offset compensation for: -0.1 I AOfls I 0.1, -0.5 I MOflsS 0.5 and 0 5 THD ,< 20% was done and a 3-D parameter variation volume was generated, with the mass density representing the number of cycles required for the algorithm to converge. The total black areas represent convergence to an incorrect solution. The volume can be given an interpretation in a simplified way, by fixing one variable at its nominal value and determining the range of variation for the other two (as a transversal slice of the volume), or it can be interpreted as the density that satisfies most of the design requirements with the best compromise for all three variables. Figure 4a it can be seen that a reasonable THD tolerance is up to 7.5% when varying the offset of the multiplier from -0.5 to -0.2 units. From Figure 4b it can be seen that a reasonable tolerance is up to 10% of THD and varying the offset of the modulators from -0.025 to 0.025 units. From Figure 4c it can be seen that a reasonable tolerance is obtained by varying the offset of the multiplier from -0.5 to 0 units and the offset of the modulators from -0.05 to 0.05 units.
Conclusions
A modification of the Wave-Parallel computing technique is proposed to solve the communication and parallel processing needed in a real time CNN. Exploiting these characteristics, a parallel processing system was simulated using the concept of Transmit Multiply and Accumulate (TMAC), that led to a system that can realize most of the signal processing algorithm during the communication phase. The simulation resulted in a complete specification of the parameters for the different blocks that compose the system.
