Abstract-This paper describes a new architecture for a cellular architectures implementing asynchronous image processing have processor array integrated circuit, which operates in both discretebeen proposed [5, 6], but while these solutions provide a and continuous-time domains. Asynchronous propagation significant performance increase compared to synchronous networks, enabling trigger-wave operations, distance transform analogues, their practical application is limited. There are also a calculation, and long-distance inter-processor communication, are few general-purpose 'smart-sensor' systems that include some embedded in an SIMD processor array. The proposed approach reut in an arhtetr tha is efiin in imlmetn both form of global, continuous-time operation [4, 7, 8].
few general-purpose 'smart-sensor' systems that include some embedded in an SIMD processor array. The proposed approach reut in an arhtetr tha is efiin in imlmetn both form of global, continuous-time operation [4, 7, 8] .
local and global image processing algorithms. In this paper we introduce a novel VLSI cellular processor array architecture, with a mixed asynchronous/synchronous operation. Each processing cell of the Asynchronous-I. INTRODUCTION Synchronous Processor Array (ASPA) has a universal Massively parallel, fine-grain processor arrays can provide a synchronous digital architecture, and also enables generation and powerful solution for image pre-processing applications [1, 2] .
processing of asynchronous trigger-waves, continuous-time Due to the regularity of image data, the pixel-per-processor execution of a distance transform, global feature extraction, etc. approach is particularly efficient in low-level image processing When the ASPA operates synchronously, it resembles a SIMD applications, providing high performance, small area and low array since every processing cell executes the same instruction. power consumption. This approach has led to the development However, it is possible to reconfigure the circuits in such a way of so-called 'vision chips', which combine a processing element that the ASPA will behave like a combinatorial circuit. Thanks to (PE) with a photodetector [3, 4] Bi-directional shifting provides a flexible tool for shift-and-add operations: GPR -> GPR, Global I/O Bus -> GPR, Neighbour multiplication and shift-and-subtract division operations. All the -* GPR. Apart from multiplexing, the BC performs a NAND memory is based on dynamic latches ( Fig. 1 c) with a shared prefunction, which is used to continuously compute a minimum charged bus (i.e. level sensitive) so there is no edge-sensitive value among nearest neighbours. The circuit for a single bit logic within a PE. A simple ALU and a FLAGS register (Zero, calculation is shown in (Fig. 2b ). This 'minimum' function is Carry, Propagation) enable a range of arithmetic and logic used in continuous-time distance transform operation, as operations to be processed. By using a subtractor in the ALU explained in the following section. we can execute both subtraction and addition without any additional hardware cost (if an adder was used then additional III OPERATION inversion would be required). The result of an arithmetic A. Distance transform operation is stored in the accumulator (ACC). The inputs of the There are a number of algorithms that involve a distance ALU are connected to the local write bus (LWB) and the ACC. The propagation chain (Fig. 2a) block is used for performing loaded with OxFF (background) or OxOO (object). The shift global trigger-wave propagations across the array [9] . The register is set for "left shift" operation with carry-in signal set to E_latch is used for storing the marker. Once the stored value is ''. The multiplexing unit of the BC is set to provide a '0' a PE will not allow further signals to propagate through and neighbour value instead ofthe value from the LRB. its output will remain '0'. The P Latch is assigned to store the Let us consider the processing of a border pixel. After loading propagation result for further processing. The minimum initial values to PEs the register 'E' is being read, therefore propagation routine consists of the following steps: 1) define the inverted values appear on the LRB accessible by neighbours.
propagation space; 2) define the propagation start points; 3) set After that, the 'Load' signal for register 'E' in all object pixels the 'start' bit to '1'; 4) load the result into P_Latch, which can goes high, configuring the register as a transition gate (with the then be used as a flag.
only possible transition '0' to '1'). In this case the left-shifted value calculated by the BC passes through the register 'E' to the C. Local andAsynchronous Communication LRB and correspondingly to its nearest neighbours (Fig. 3) and The communication between PEs is organised via the shared so on. Since distance values are represented according to Fig. 4a , LRB. Every PE can access the LRB of its four neighbours, so the NAND operator in the BC calculates the minimum distance the result of any local operation can be directly transferred to the value and left-shift performs an increment function. As the result neighbourhood, eliminating intermediate local operations (i.e.
of this propagation, we will achieve the exact Manhattan both synchronous and continuous-time transfers are possible). distances for pixels within the eight pixel distance range (Fig. The data in the LWB is managed by the BC. Essentially the BC 4b). This process is robust and is not sensitive to any nonis a multiplexer, which enables the set of following transfer uniformity of propagation velocity. In order to calculate distances to all the object pixels, it is calculation of a single row takes 64 instructions. So the 8x8 necessary to perform [R/8]+1 iterations, where R is the object linear discrete transform will require 1152 instructions radius. A single distance transformation requires 5 instructions (including loading two matrices) for any image size, using only (subject to initial conditions).
3 memory elements (8-bit registers). It is also possible to calculate the global minimum and to transfer data between distant pixels with a single instruction (a C. Watershed Transform process similar to the one described above, but without shifting).
The watershed transform is an example of a complex, In general, the architecture supports chaining pixels in various computationally expensive image processing algorithm that ways and performing other global operations such as those comprises different morphological operations (binary and greydescribed in [7] .
scale geodesic reconstruction, regional minima extraction, etc.).
B. Matrix Multiplication
Watershed segmentation helps to extract objects of interest from the background on grey-scale images. The advantage of using the proposed architecture for Essentially, watershed segmentation consists of two main synchronous operations is demonstrated on matrix procedures: basin marking and flooding. For efficient multiplication, the basis of linear discrete image transforms. The implementation we have decomposed both of them into simple ability to perform such an operation will significantly simplify wave-propagation operations. Basin marking, which is based on real-time compression procedures. a grey-scale reconstruction, has been implemented as a set of Assume we have an array P, and we need to derive a new binary reconstructions (254 single-step iterations). Thanks to a matrix B as a product of P and another matrix A, i.e. B=PxA, propagation unit in ASPA, it is possible to perform binary where reconstruction over an entire image in a single-iteration. After n the basin marking operation, an additional step is required. B.. = P .I% X Ak] During this step it is necessary to find, for every PE, the k=1 neighbour(s) with the minimum brightness value. Then an Let us refer to a pixel associated with a PE at grid location (i,j) appropriate mask is applied to the propagation chain. As a result as Pij, and to a register A (B,C,...) of this PE as Aij (Bij,Cu...) . of this operation, utilising the local autonomy feature, the Then to get the product of two matrices it is necessary to propagating signal in each cell will be accepted only from the accomplish the three following steps: 1) load to Aij the lowest neighbours. This will implement the required "drop corresponding value of matrix AT; 2) perform multiplication effect", i.e. if we leave a drop of water on a physical landscape, C1j=P11xA11; 3) n times perform B j=Cij+Cij+1; Cij=Cij+1. As a result at each point it will roll downhill along the maximum decline, of this operation we will achieve the 1st column of matrix B. In finally reaching the basin's origin. total, we have n multiplications, and n2 additions and registerAfter this procedure, propagation is initiated at the basin transfer operations. In the case of a transform such as DCT with markers (one at a time). Due to the configuration of the 8x8 image blocks, a significant performance increase will be processing array, propagation will only spread within a single achieved because all the operations are performed in parallel.
basin and along the watershed lines. After each propagation Moreover, with an advanced addressing mechanism (e.g. the operation the border pixels will form a watershed line. By address xxxx000 will indicate all the columns out of 128, sequentially performing such operation, basin by basin, the divisible by 8) [10] it is possible to have a fixed amount of watershed line of the entire image is obtained. By using operations (for the above case only 8 multiplications) asynchronous propagations, the number of operations involved irrespectively of image size. The multiplication, based on shiftin the flooding is reduced and is proportional to the number of and-add operations, is implemented by 40 instructions. The basins. In our simulation experiments, the complete processing of a operation of the ASPA has been verified by simulations and 64x64 image with nine areas of interest required around 2900 FPGA implementation. Currently, we are working on a fullinstructions on the ASPA. Assuming the iteration time to be custom VLSI circuit design. 100 ns the total processing time is 2.9 hts. The simulation results are presented in Fig. 5 
