Abstract-This paper describes a new mesh-connected SIMD architecture, called a Sliding Memory Plane (SIiM) Array Processor. On SIiM, the inter-processing element (inter-PE) communication, using the sliding memory plane, and the data input/output (I/O), using two U 0 planes, can occur without interrupting the PE's, which greatly diminishes the communication and I/O overhead. SliM is unique in its ability to overlap inter-PE communication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller. In addition, SliM uses four rather than eight links per PE to provide eight-way connectivity using the by-passing path, thus reducing the diagonal communication time and eliminating the necessity of diagonal links. The realization of these virtual links for diagonal communication without instruction overhead is another novel feature of SYM. An alternative method to achieve diagonal communication is to use two sliding memory plane shifts that can be overlapped with computation. The bypassing path can also accomplish nonlocal communication and broadcast. This paper illustrates the unique advantages of these inter-PE and diagonal communication schemes and proposes new parallel algorithms for image processing on SliM that have a zero or an O(1) communication complexity. With these salient features, SliM shows a significant performance improvement, illustrated with several tasks including the DARPA low level vision benchmarks.
edge storage element (i -l), lower edge storage element (i+l), and the CLIP7 processor (i) [18] . This external memory read/write overhead cannot be overlapped with computation, which creates communication overhead through the external RAM. The CLIP7 coprocessor is not complete in itself as it requires external devices (latch, transceiver, and buffer) to isolate or connect the various data buses. In addition, this scheme provides only a 3 by 3 neighborhood of data. Since the processor and the coprocessor can handle the computation and data 1/0 separately, I/O can be overlapped with computation.
AMT DAP 510 and 610 have two processors: a 1-b processor for communication and computation, and an 8-b coprocessor for computation only [ 191. Inter-PE communication can occur through the 1-b processor during coprocessor computation. Since the 1-b processor and the 8-b coprocessor reside on different chips, communication between the two can occur through the array memory. This causes external memory read/write overhead which may degrade the performance. The 1-b processor executes simple computations, such as Boolean operations and integer addition, since there is no benefit in routing them to the external memory and coprocessor [19] . Inter-PE communication cannot occur during such computations. Above all, every PE on DAP requires the coprocessor for partial overlapping.
SliM requires neither a coprocessor nor an on-chip Direct Memory Access (DMA) controller to overlap inter-PE communication without interrupting PE's. During computation, the contents of all register cells on the sliding memory plane can be shifted simultaneously and in the same direction to the neighboring cells. This inter-PE communication overlapping regardless of the window size and shape is unique to SliM.
Many mesh-connected SIMD architectures mentioned above are not capable of 1/0 overlapping. As a result, 1/0 overhead may degrade their performance. In contrast, SliM has two 1/0 planes that provide I/O overlapping without interrupting PE's. Since communication, I/O and computation occur simultaneously, communication and 1/0 overhead can be overlapped with computation, significantly diminishing communication and I/O overhead.
Moreover, architectures such as CLIP4, CLIP7, BAP, and NTT have six or eight communication links per PE to reduce the overhead for diagonal communication. The realization of virtual links for diagonal communication using only four links per PE without instruction overhead is another unique feature of SliM. Although the XNet three-state interconnect on MasPar [30] (similar to the BLITZEN [31] grid network) has eight connectivity using four links, each connection requires 3 instruction cycles for setup alone. In addition, MasPar requires 1 instruction cycle for communicating each bit. Thus, MasPar has inter-PE communication overhead for all communication. BLITZEN also needs 1 instruction cycle for any 1-b inter-PE communication [31] , as does MPP. In contrast, virtual diagonal communication using the by-passing path on SliM needs several nanoseconds for gate delay, which can be overlapped with computation. An alternative method to achieve diagonal communication is to use two sliding memory plane shifts that also can be overlapped with computation. Therefore, four rather than eight links are sufficient for eight connectivity, greatly reducing the diagonal communication time and eliminating the necessity of diagonal links.
The by-passing path can also perform nonlocal communication and broadcast. SliM provides various types of communication (local communication, nonlocal communication, and broadcast). Each PE in SliM can operate separately based on three autonomies (operation autonomy, addressing autonomy, and connection autonomy). As in CLIP7A, DAP, and MasPar, SliM uses bit-serial communication and bitparallel computation in which more area of VLSI can be saved and the number of pins can be reduced.
Fang et al. [35] describe the inter-PE communication requirements for a general 2-D convolution on a typical meshconnected architecture. In their paper, a general 2-D convolution algorithm on a mesh-connected architecture has an O ( W 2 ) communication [38] , which is illustrated by several examples of image processing tasks, including the DARPA low level vision benchmarks.
The remainder of this paper is organized as follows. Section I1 introduces the SliM architecture and compares its features to existing mesh-connected architectures. The section addresses virtual connectivity, various types of communication, and local autonomies. Section 111 establishes the analytical model of SliM for performance evaluation and compares it to existing mesh-connected architectures. Section IV discusses the applications to image processing tasks. Section V describes the performance evaluation based on timing analysis using the analytical model and presents computation and communication complexities for image processing tasks. Finally, Section VI contains concluding remarks.
THE ARCHITECTURE OF SliM
This section describes the architecture of SliM and presents the overall system and structure of a PE. The section then addresses the issues of connectivity, communication, and autonomy. Fig. 1 shows the logical diagram of SliM. The processor plane consists of N x N processors and the total number of PE's is N 2 . The sliding memory plane S consists of N x N shift registers, connected by a grid network. The S plane, instead of the processor plane, forms a mesh topology. The top row of the sliding memory plane is connected to the bottom row to form a wrap-around mesh connection scheme. Similarly, the leftmost column is connected to the rightmost column (torus interconnection 
A. The Overall System

B. A Processing Element
The processing element (PE) shown in Fig. 3 consists of an ALU (Arithmetic Logic Unit) providing Boolean functions as well as arithmetic functions, registers, multiplexers (MUX's), a demultiplexer (DMUX), and a 4 x 2 switching element (SW). We can use two three-state drivers instead of a MUX. In practice, SW can be realized by two 4 x 1 MUX's with the same input lines and different output lines. The shift register s is an element of the sliding memory plane S shown in Fig. 1 SOLOMON, ILLIAC IV, CLIP6, etc., on the other hand, are based on bit-parallel communication and computation. MasPar, CLIP7A, and DAP are based on bit-serial communication and bit-parallel computation. Since most operations in image processing are performed on grey-level rather than binary data, bit-parallel is better suited to image processing [16] [17] [18] . Bitparallel links between PE's occupy a large portion of the VLSI area and contribute to an increase in the number of output pins on a VLSI chip. Hence, the link between neighboring PE's on SliM is one-bit wide, so that more VLSI area can be saved and the number of pins can be reduced. In Fig. 3 , the thick lines represent 8-b parallel datapaths while the thin lines represent bit-serial datapaths. Each register cell contains one pixel and the 8-bit ALU operates in bit-parallel. Thus, SliM operates in bit-serial communication and bit-parallel computation as in MasPar, CLIP7A, and DAP.
For fast sliding, shifting and I/O operations, two clock rates are used-one for normal operations and the other for sliding, shifting and 1/0 operations. Sliding and 1/0 operations are between-chip operations, and the faster clock rate is completely dependent on the delay between chips. As discussed, sliding and I/O operations are basically the data transfer between shift registers. Hence, it may be possible to realize 8-b inter-PE communication for the sliding operation and 8-b data transfer for 1/0 can be completed within one instruction cycle. In this case, the faster clock rate is eight times faster than that for instruction. If the delay between chips is not fast enough, then a slower clock should be used for sliding and 1/0 operations.
The number of transistors in the 16-bit PE of CLIP7A is approximately twice as large as that in the 8-bit PE of SliM. 6800 transistors are used in the PE of CLIP7A [18] , while 6000 transistors are used for eight bit-serial PE's in one chip on MPP [ 141. The current VLSI technology achieves over one-million transistors on a single chip [39] . To be conservative, it may be assumed that the 8-b PE of SliM requires approximately the same number of transistors as the 16-b PE of CLIP7A or as eight 1-b PE's of MPP. Therefore, it would be possible to build a number of the PE's of SliM on one VLSI chip. 
C. Connectivity and Communication
As shown in 1) receiving mode: one of the neighboring pixels is received in the sliding register s. 2 ) by-passing mode: one of the neighboring pixels is passed to another neighboring PE without receiving in s. 3 ) receivinglby-passing mode: one of the neighboring pixels is received in s and this pixel or another neighboring pixel is passed to one of the neighboring PE's. Even though three connection modes are provided, only the receiving mode is necessary for the sliding operation. The other two connection modes are used for different functions which will be described later. The connection modes are determined by the status of the condition register C in each PE. The processor control subunit can globally change the status of C in every PE to achieve centralized control. On the other hand, each PE can locally change the status of C and thus control connectivity independently to achieve distributed control. The distributed control strategy for each PE depends on data and algorithms which will be discussed later.
Using three connection modes makes it possible to produce the virtual communication links for diagonal neighboring PE's, shown in Fig. 5 . For example, if the west PE sets the bypassing mode for the center PE, and if the southwest PE sets the receiving mode for the west PE, then the virtual link from center to southwest is realized. The dotted arrow line is referred to as a virtual link for diagonal communication.
Similarly, other virtual links can be realized. However, two neighboring PE's in the same row (or column) cannot send data to their diagonal PE's at the same time because of communication link conflicts. All virtual diagonal links for all PE's (NE, NW, SE, SW) can be realized simultaneously by two sliding memory plane shifts that can be overlapped with computation. Even with four communication links, eight connectivity (four physical links and four virtual links) can be achieved by adding the by-passing path. These virtual links are especially advantageous for computations along the border pixels of regions. This type of computation, which is commonly used in image processing [40] , will be discussed in detail later.
SliM employs three different communication schemes: local communication between nearest neighbors, nonlocal communication between nonnearest neighbors, and broadcast. As shown in the previous section, concurrent local communication in the same direction can be realized by using the sliding memory plane (sliding operation). Any two PE's can communicate by using three connection modes. For instance, the left uppermost PE can communicate with the right lowermost PE by forming a virtual communication link. The PE's located right above or right below the diagonal direction set the bypassing mode. Then, the nonlocal communication link between the left uppermost PE and the right lowermost PE is formed. Hence, using the connection modes accomplishes any nonlocal communication.
In addition, SliM provides three different broadcast schemes (shown in Fig. 6 ): row-broadcast, column-broadcast, and broadcast. For row-broadcast, every PE in a row, except the one issuing broadcast, sets the receivinghy-passing mode. In each PE, the broadcast data is simultaneously stored in the s register and passed to a neighboring PE. All by-passing paths in the row form a bus via SW's and MUX's, and each s register in a PE is connected to the formed bus, thus achieving the row-broadcast scheme. Similarly, column-broadcast can be realized.
For broadcast, all the PE's in the same row, except the one issuing broadcast, set the receivinghy-passing modes as in row-broadcast. Every PE in the rows above sets the receiving mode for its south PE and the by-passing mode for its north PE. Every PE in the rows below sets the receiving mode for its north PE and the by-passing mode for its south PE. Then, all the PE's can receive the broadcast information. The control strategy for these communication schemes is determined by the processor control subunit or by each PE according to the data and algorithms (data or algorithm driven control strategy). Because propagation and gate delays for nonlocal communication and broadcast are not negligible [9] , these communications may require several clock cycles. These times are functions of the program length ( L ) for a specific algorithm, image size ( I 2 ) , and the number of PE's employed ( N 2 ) . If the image size is larger than the array processor, then the size of a subimage becomes N 2 . The total number of subimages (n,) is [12/N21. For simplicity, n, is assumed to be one; in other words, the size of SliM is larger than or equal to the size of the input image. If n, is larger than one, the following times are multiplied by n,. The total time to process a whole image can be expressed by 
D. Three Autonomies
where n, is the number of bits to be transferred to neighboring PE's for a specific algorithm, and t, is the communication time for one bit between neighboring PE's. Note that t , on SliM is about 8 times faster than existing architectures because of SliM's separate fast clock for sliding operations. Since the width of SliM's communication link is one bit, the number of bits to be transferred must be considered instead of the number of bytes. Therefore, the total time for a whole image is again expressed by
Since SliM has a buffering capability, I/O can be overlapped with processing. In this case, the total processing time reduces to
In addition, SliM is capable of inter-PE communication during computation, thus, communication can be overlapped with computation. However, for some tasks inter-PE communication cannot be fully overlapped and some portions of Tpp may still exist. This further reduces the total processing time.
The total time TA can be expressed as follows:
where p is the nonoverlapped portion of T p p with Tcp.
As (7) shows, SliM's total processing time can be expressed by one of three components, a small portion of Tpp, or both. In general, the computation time is larger than the 1/0 time or the inter-PE communication time. Hence, on SliM the total processing time is composed of only pure computation time, with little or no communication time. In contrast, the total processing time for most bit-serial mesh-connected SIMD architectures is expressed by (5), with larger ni and longer t,.
Iv. APPLICATIONS TO IMAGE PROCESSING
Because it performs window operations with little or no communication overhead, SliM is well suited to image processing, where excessive data exchange occurs between neighboring PE's. For instance, 2-D convolution, median filtering, average value, template matching, zero-crossing, etc., are suitable applications for the proposed architecture. Edge detection can be performed by using 2-D convolution algorithms. The Gradient, Laplacian, difference of Gaussians, Laplacian of Gaussian, and Sobel operators are some examples. After the convolution of an image with these operators, SliM can efficiently detect edges.
This section presents parallel algorithms for a general 2-D convolution and the Sobel operator and demonstrates how communication overhead for convolution is entirely overlapped. In tasks such as the Sobel operator, communication overhead cannot be entirely overlapped. In many image processing tasks, computation takes place along the border pixels of regions [40] . The K-curvature, 1-D Gaussian smoothing along the border, and perimeter and area calculations are a few examples. The virtual communication links are advantageous for these tasks. The 1-D Gaussian smoothing along the border is illustrated to show the suitability of SliM's virtual links for these tasks. On SliM, these tasks can be implemented without communication overhead. More details will be discussed later. 
A. 2 -0 Convolution
A parallel 2-D convolution algorithm [37] is highly suited for implementation on SliM. Fig. 7 shows a 3 by 3 convolution window. The arrow represents the direction of sliding operations. If the direction starts at the center pixel and ends at the southwest pixel in a counterclockwise direction, the direction is: 0 -+ S + E -+ N + N + W 4 W 4 S -+ S. The sequence of pixels to be accessed in each PE is opposite to the direction of sliding operations: its own pixel, north, northwest, west, southwest, south, southeast, east, and northeast pixels. Every PE can receive its neighboring pixels in this order. The direction of sliding operations is like a Hamiltonian path that starts at any node and visits every node only once. After sliding into all neighbors within the window, every PE can get its final result concurrently.
The equation of the general 2-D convolution is expressed as follows:
w-1 w-1 k=O 1=O where I i j is the value at the input pixel i , j ; 'wkl is a window coefficient; and i i j is the value at the output pixel i , j .
Suppose Since the time for a sliding operation is much less than the time for one multiplication and one addition, a sliding operation can be completed within the computation time. If SliM is employed, W 2 multiplications and 2(W2 -1) additions are needed, regardless of the image size. The inter-PE communication overhead is completely overlapped, and the 1/0 overhead can also be overlapped if the computation time is larger than the 1/0 time. Thus, the computation complexity is O ( W 2 ) and the communication complexity is zero. Therefore, the total processing time for a whole image consists of only the computation time. Section V presents a more detailed algorithm. Since the direction of sliding operations is programmable and flexible, any shape and any size of a window can be employed on SliM with little or no communication overhead. In contrast, existing mesh-connected SIMD architectures may suffer from performance degradation,
L3-51.
particularly when the window size is larger than 3 by 3 or the shape of the window is not square. Fig. 8 shows the Sobel operator. While the direction of sliding operations for X-magnitude passes through the center and its north and south coefficients, no computation is required because of zero coefficients in the window. Thus, the communication overhead cannot be entirely overlapped. However, other necessary operations can be executed during the nonoverlapped communication time, (e.g., storing intermediate results), then communication overhead can be further reduced. Section V describes this case in detail.
B. The Sobel Operator
C. The One-Dimensional Gaussian Smoothing Along the Border
SliM's virtual communication links can be efficiently used for the computation along the border pixels of regions. Fig.  9 shows the border of a region including virtual links and physical links. The 1-D Gaussian smoothing along the border is completed by summing the products of each pixel along the border with each coefficient. The route for this computation, shown in Fig. 9 , is formed by the connection modes.
One possible routing strategy is determined by the following procedure:
Each PE checks its eight nearest neighbors using sliding operations to determine if it has a border pixel (called a border PE) or not (a nonborder PE After setting the route, the sliding operation is used in the clockwise or counterclockwise direction along the border within one cycle. Hence, the distributed control strategy is determined by the data. As in the 2-D convolution, during a multiplication Our performance evaluation of SliM is based on the following conservative assumptions. First, memory access time (1 byte) which is 100 ns, is defined as a nominal instruction cycle time. In practice, the time for 1-b memory access is equal to the time for 1-byte memory access. Only one memory access, with no operation, can be executed in one cycle. Second, two operations at most can be merged into one instruction if no conflict exists, and this instruction can be executed within one cycle. Thus, ti is 100 ns. Third 
v. ALGORITHM COMPLEXITIES AND EXPECTED TIMES FOR IMAGE PROCESSING TASKS
This section discusses the algorithm complexity, including the computation complexity and the communication complexity for image processing on SliM. The section then presents estimates of the expected times for image processing tasks. In many papers, the algorithm complexity on mesh-connected architectures is based on the overall time complexity, including the computation complexity and the communication complexity. Fang et al. describe the communication complexity of the generalized 2-D convolution on array processors [35] . In unit time, each PE can send or receive a word of data from each of its neighbors [35] , [41] . Also, standard arithmetic and Boolean operations can be executed in unit time. In other words, an O ( n ) complexity means that an algorithm requires at most C1 * n inter-PE communication steps and C2 * n instructions, where C1 and C2 are positive constants.
Fang et al. [35] deal with the communication complexity and the computation complexity separately. In their paper, a general 2-D convolution algorithm on a mesh-connected architecture requires an 0 ( W 2 ) communication complexity The assumptions used for our performance evaluation of SliM are based upon the figures for MPP. In MPP, one memory access and several operations can be merged into one instruction which can be executed within one instruction cycle (100 ns) [12]- [14] . The actual memory access time is about 50 ns [ 131. Moreover, 1-b communication between neighboring PE's takes 100 ns, namely, one instruction cycle [32] . In the followings, the term cycle means the instruction cycle.
words, an 8-b shift is the maximum for one cycle. Fifth, most operations, such as addition, shift, compare, etc., are assumed to be executed within one cycle, excepting multiplication and division. The multiplication of two 8-b integers requires eight additions and seven 1-b shifts. One addition and one 1-b shift can be combined into an instruction, which then can be executed in one cycle. If two operands are from registers, and the result is stored into registers, 8 cycles are needed for a multiplication. But if two operands are from memory, and the result is stored into memory, 12 cycles are needed. Sixth, to simplify the performance evaluation, n, is assumed to be 1; in other words, the size of SliM is equal to that of the image (512 x 512). If ns is not 1, the total processing time becomes the processing time for a subimage multiplied by n,. Since a set of image processing tasks is subsequently performed on the same image, the 1/0 time is assumed to be less than the total computation time and is overlapped.
The method for the performance evaluation is as follows. The register transfer level algorithms are made. Then, the number of instructions required for these algorithms is counted. Based on the analytical model of SliM, expected times are measured. Since sliding (in other words, inter-PE communication) can occur during computation, sliding operation and computing operation statements are put on the same line. In the following description of the algorithm, wherever more than one statement occurs on the same line, then these statements can be considered to overlap. As described, on completing the sliding operation, the content in s is transferred in parallel into the latch I to prevent the data in s from conflicting with the new data coming from a neighbor. The step for transferring the content in s into 1 is omitted.
A. A Convolution Algorithm with a Zero Communication Complexity
Assume that window coefficients are in the broadcast instructions. Fig. 10 shows the general algorithm, in which s represents a shift register on the sliding memory plane S, and T is a set of registers in a PE. As Fig. 10 shows, W 2 multiplications and 2(W2 -1) additions must be required, and inter-PE communication is entirely overlapped. The portion of T p p that does not overlap with Tcp, that is, p , is zero. Thus, the algorithm requires an O( W 2 ) computation complexity and a zero communication complexity. If the window size is 3 by 3, and the number of bits per pixel is 8, the algorithm requires 9*nm + 16 cycles, where n, is the number of cycles for two 8-b integer multiplication. Since the register set T is used instead of memory, n, is assumed to be 8 (8 shiftdadditions) . Thus, 88 cycles are required, ni is 88, and ti is assumed to be 100 ns. From (3) and (7), the total time estimated is 8.8 p s .
B. A Sobel Operator Algorithm with an O( 1) Communication Complexity
Some algorithms cannot be implemented without the communication overhead. The Sobel operator is one example. Fig.  7 shows the Sobel operator and Fig. 11 shows the algorithm. Assume that its own pixel is initially in the s register. As shown in Fig. 7 , when the direction of sliding operations passes the center, north, and south, the computation is not required. Thus, other necessary steps can be executed. For example, moving its own pixel into memory and storing its pixel into T for an X-magnitude computation can be executed in the first two statements. If no step needs to be executed, the communication overhead cannot be overlapped. In the algorithm shown in Fig. 11 , the sixth statement for each magnitude is not overlapped. As Fig. 12 shows, during comparison and addition, each pixel in the s register can be passed into one of its neighboring s registers and the communication overhead is entirely overlapped. Each iteration within each for loop requires two operations. Thus, this algorithm requires an O ( N ) computation complexity and a zero communication complexity. If N is assumed to be 512 and the maximum intensity value is assumed to be 511, this task requires about 1539 cycles and takes about 153.9 ps. The detailed counting is omitted.
D. An Average Value Algorithm with a Zero Communication Complexity
The average value algorithm requires 2(W2 -1) additions and 1 division. The algorithm described below is similar to the 2-D convolution algorithm, except for the calculation. Since the inter-PE communication can be overlapped with computation, this algorithm requires an O( W 2 ) computation complexity and a zero communication complexity. If a 3 by 3 window is used, this task requires 17 + n, cycles, where n, is the number of cycles for a division.
In the worst case, in other words, when the intensity values of all pixels within the window are 255, the expected maximum value of T is 2295, which can be expressed by 12 b. Thus, the division may consist of 12 additions and 12 shifts if the nonrestoring division technique is applied since one addition and one 1-b shift can be executed in one cycle. This algorithm requires about 29 cycles and takes about 2.9 ps.
E. A Median Filtering Algorithm with a Zero Communication Complexity
In general, median filtering requires a sorting algorithm after all neighboring pixels are collected. Since the sorting algorithm itself may take a long time, and since sorting and collecting neighbors cannot occur simultaneously, median filtering is
C. A Histogramming Algorithm with a Zero Communication Complexity
Histogramming, which is not a window operation, is a timeconsuming task on MPP due to the inter-PE communication overhead [32] . The parallel algorithm for histogramming on SliM is based on the algorithm proposed in [32] , which consists of two main steps: histogramming columns (voting) and totalling rows (summing). In the first step, every pixel is passed to the north (or the south) cyclically using the wraparound feature. Whenever the gray-level of the received pixel is the same as the row number of a PE, the counter in that PE is incremented. After the voting, every value of a counter is passed to the west for summing. The leftmost column of PE's sums the value of a counter and, finally, contains the histogram for the image.
The differences between the Kushner et al. algorithm on MPP [32] and our algorithm are that inter-PE communication overhead can be overlapped with computation and that bitparallel computation can be performed on SliM. Assume that each PE has two numbers which represent its row and column numbers based on a mesh topology. N is the number of PE's in a row or a column. Fig. 12 shows the algorithm on SliM, where T1 and T2 are registers in the register set T. Each pixel is assumed to be in the s register. T1 contains the row number of the PE if the row number is less than or equal to the maximum intensity value. T2 acts as a counter. a time-consuming task. In contrast, on SliM, collecting and sorting procedures are not required in the new proposed algorithm.
The ordered, singly linked list shown in Fig. 14 is used for median filtering, where pixel i represents the zth received pixel. The order in the list is pixel 1 5 pixel 3 5 pixel 4 5 pixel 2. Each PE has its own list. After shifting the sliding memory plane, each PE can access its neighboring pixel and insert it into its list in order. While this insertion occurs, the sliding memory plane can be shifted. Thus, collecting can be overlapped with inserting. 
T c T -I ;
T c T + I ;
T e T + (I << 1) list, the median value in each list can be easily found in the middle of the list simultaneously. Thus, the total processing time consists of only the time needed for making the list. In contrast with a typically used median filtering algorithm that consists of collecting and sorting procedures, a sorting procedure is not required, and a collecting procedure (inter-PE communication) is invisible in the new algorithm. The worst case is one in which the pixel is greater than the pixels in the list, every time it is received from a neighbor. Thus, the pixel just received should be compared with all the pixels in the list. In this case, the new algorithm requires an O( W 2 ) computation complexity and a zero communication complexity. For a 3 by 3 window and an 8-b/pixel, the estimated time is 11.2 p s in the worst case on SliM. In contrast, a typical median filtering algorithm using collecting and sorting requires about 16.6 p s in the worst case on SliM. Further details of this calculation are omitted.
In summary, Table I presents the computation and communication complexities and the expected times for the algorithms previously mentioned.
F. The DAREA Low Level Vision Benchmarks
The estimated performance figures of SliM for the DARPA low level image understanding benchmarks [40] , presented in [3], show significant improvements compared with those of existing architectures [3], [38] . Tables I1 and 111, can be overlapped with the computation. Second, bit-parallel processing is faster than bit-serial processing. Third, in most existing mesh-connected SIMD machines, PE's store pixels into memory after pixels are received. During processing, the pixels stored in memory must be accessed for computation. This memory access overhead is significant. In contrast, SliM's S register plane contains all pixels, which can be transferred to neighbors during computation and directly accessed by the ALU. Thus, the overhead for memory access can be reduced.
Fourth, a set of registers (7') can be effectively used in place of memory for storing intermediate results, further reducing the overhead for memory access. In summary, SliM is a flexible and reconfigurable architecture that has unique features which can alleviate the drawbacks of existing mesh-connected SIMD architectures. Performance degradation due to those drawbacks is minimized to allow higher throughput. The concept of the sliding memory plane may also be applicable to other special purpose VLSI architectures. Future research will investigate more detailed VLSI implementation issues and the applicability of the sliding memory plane idea for special purpose architectures.
