I. INTRODUCTION THE Cellular Neural Network (CNN) was invented in
1988 [1] . The nonlinear CNN paradigm was developed by Roska and Chua [2] . In case of nonlinear CNN the template values are defined by a nonlinear function of input variables (nonlinear B) and the output variables (nonlinear A). Most of image processing problems can be solved using linear CNN templates but using nonlinear CNN templates several complex image processing tasks can be solved more easily such as histogram generating [3] , Hamming distance computing [3] , grayscale skeletonization [4] . It is notably useful when the nonlinear CNN has just a nonlinear feedforward template, because less time is required to solve the problem. The linear CNN has several implementations the software, the emulated digital VLSI (ASIC/FPGA) and the analog VLSI implementation. However several studies proved the effectiveness of the nonlinear CNN templates it is not supported on the recent CNN implementations. Analog VLSI implementation of a programmable nonlinear CNN is difficult by using present day technologies, so the only way is using software simulation, but it has even lower performance than in the case of linear CNN. Emulated digital implementations can be very efficiently used in the emulation of linear CNN arrays [5] . Additionally the flexibility of the FPGA implementation can be exploited to handle nonlinear CNN templates. In the next section two classes of nonlinear templates will be introduced. After a brief introduction of the Falcon emulated digital CNN-UM architecture the required modifications are described to make it possible to use nonlinear CNN templates on this architecture.
II. TYPES OF NONLINEARITY
In case of nonlinear CNN some template values are defined by a nonlinear function. This value depends on the difference of the currently processed cell and the value of the cell belongs to the actual template element. Investigating the CNN Template Library two types of nonlinear templates were defined. These are the zero and the first order nonlinear templates. Classifications of the different nonlinear templates are shown in Table I .
In case of the zero order nonlinear templates the nonlinear function contains horizontal segments only as shown in Figl. Zero order nonlinear templates are used for example grayscale contour detection [6] .
In case of the first order nonlinear templates the nonlinearity contains straight line segments as shown in Figl. These kinds of nonlinear templates are used for example global maximum finding [7] . Naturally there are nonlinear templates where the template values are defined by two or more different nonlinearities, for example grayscale diagonal line detection [8] . 
IV. THE MODIFIED FALCON ARCHITECTURE
To implement the nonlinear template runner emulated digital CNN-UM architecture the original Falcon structure was modified as follows. The Memory, the Mixer and the ALU are the same as with the original Falcon processor. But the Template memory was changed to be able to handle the nonlinear templates and their nonlinearity. These changes will be introduced in the next two subsections. As known the actual nonlinear template values are defined by the nonlinear function of the difference of the currently processed cell and the value of the cell belongs to the actual template element. So not only are the data from the Memory is needed but also the data from the Mixer are required to define the actual nonlinear template value so the outputs of the mixer was also connected to the Template memory.
V. ZERO ORDER NONLINEAR TEMPLATE MEMORY
In the original Falcon processor the template operations are performed row-wise. A RAM belongs to every column of a template in the Template memory. The actual template values are read out from the RAMs and transmit to the input of the ALU. In the Template memory of the modified Falcon processor also a RAM belongs to every column of the template but the values of the segments of the nonlinearity are stored in the RAMs. So in case of n segments the RAMs are n times larger. In the example the nonlinearity was partitioned into four segments as shown in Fig. 3 . The segments are loaded into the RAMs as shown in Table II . We act upon this way because the two MSB bit of the difference which mentioned above is used to address the RAMs. In general the number of segments is power of two and more MSB bits are required. 
functional elements. The Parallel Port interface is used to connect to the parallel port and to configure the processor. The VGA interface is used to connect the monitor as the output of the processor and the Camera interface is used to connect the Camera as the input of the processor. These interfaces are part of the Platform Abstraction Layer (PAL) API from Celoxica. The real operable system on FPGA is shown in Fig. 6 . 
B. Area requirement
Similarly to the original Falcon processor the precision and the number of fraction bits of the state value, the constant value (which is determined by the feed forward equation) and the template value also can be configured in the case of the zero and first order nonlinear Falcon processor. The large number of possible configurations does not make it possible to investigate all cases. We just want to represent how the area requirement changes in case of different bit width and precision of data. In this test the precision of gij is set to 10 bit and the number of fraction bits is set to 6. The precision of the template values are set to 9 and the number of fraction bits is set to 7, which seems to be enough for most applications. The area requirements of the zero (A) and first (B) order processor in case of different state width are shown in the Fig. 7 .
The investigation shows that the general resource requirement of the zero and the first order Falcon processor depends on the precision of the state value linearly. Although the requirement of the dedicated resource of these processors also depends on the precision of the state value. If the precision of the template values are changed, a similar behavior can be observed. In the next step it was investigated how many zero (A) and first (B) order Falcon processor can be placed on different kind of FPGAs if the precision of the state value is increased from 4 to 36. The results are shown in Fig. 8 . In the case of Virtex-JI 3000 FPGA nineteen zero order processors can be implemented. The maximum clock frequency of the processor is limited to 133MHz by the speed of the memories on our RC203 card. The cumulative computing performance of these processors is 842 million cell iteration/s, which is 421 times faster than the software simulation. If the currently available largest FPGA the Virtex-4 SX55 is used 39 processors can be implemented and 400MHz clock frequency can be achieved. The resulting computing performance of the processor array is 5.2 billion cell iteration/s, which is 2600 times faster than the software simulation.
Implementation of the first order nonlinear Falcon processor requires additional flip-flop and LUT resources compared to the zero order case. However these elements can be packed more densely and in our test case smaller amount of slices is required. This makes it possible to implement 16 processors on the Virtex-II 3000 and 40 processors on the Virtex-4 SX55 FPGA. The maximum clock frequency of the first order nonlinear processor is not affected by the higher order nonlinearity thus 133MHz and 400MHz clock frequency can be achieved on the Virtex-II 3000 and Virtex-4 SX55 respectively. The computing performance of the first order processors is similar to the zero order case. However the performance of the conventional microprocessor is smaller in this case thus our processors provide higher speedup. Finally the computing performance of the zero (A) and first (B) order Falcon processors was compared to the software simulation in case of Virtex-II 3000 and Virtex 4 SX55. This is shown in Fig. 9 . Speed of the processor nonlinearity. To compute the zero and the first order template elements the template memory is significantly modified. In the zero order case the area requirement of the new processor does not increase significantly while in the first order case additional dedicated resources (MULT 18xl 8) are required. The new architectures are implemented on our RC203 prototyping board and 421 times speedup is measured compared to an AMD Athlon64 3200+ microprocessor. Using the currently available largest FPGA even 2600 times higher performance can be achieved which makes our architecture ideal for real-time image processing tasks. 
