Abstract-A novel reconfigurable architecture based on a multiring multiprocessor network is described. The reconfigurability of the architecture is shown to result in a low network diameter and also a low degree of connectivity for each node in the network. The mathematical properties of the network topology and the hardware for the reconfiguration switch are described. Primitive parallel operations on the network topology are described and analyzed. The architecture is shown to contain 2D mesh topologies of varying sizes and also a single one-factor of the Boolean hypercube in any given configuration. A large class of algorithms for the 2D mesh and the Boolean n-cube are shown to map efficiently on the proposed architecture without loss of performance. The architecture is shown to be well suited for a number of problems in low-and intermediate-level computer vision such as the FFT, edge detection, template matching, and the Hough transform. Timing results for typical low-and intermediate-level vision algorithms on a transputerbased prototype are presented.
INTRODUCTION
ROBLEMS in computer vision are known to be computationally intensive. Inherent limitations on the computational power of uniprocessor architectures, especially under real-time constraints, have led to the development of multiprocessor architectures for computer vision problems. From a theoretical point of view, a multiprocessor architecture should facilitate the design and implementation of efficient parallel algorithms for computer vision problems. These algorithms should optimally exploit the capabilities of the architecture. From an architectural point of view, the multiprocessor architecture should have low hardware complexity and, preferably, be composed of components that can be easily replicated thus making it suitable for VLSI implementation. Additionally, the architecture should exhibit good scalability of computational performance, hardware complexity, and cost with increasing number of processors.
Many multiprocessor architectures have been proposed for computer vision such as, the hypercube [29] , butterfly [7] , systolic array [9] , 2D mesh [10, [34] , and the pyramid [35] . Several researchers have attempted to map computer vision problems on these architectures. However, the fixed topology of the interconnection networks describing these architectures leads to an inevitable tradeoff between the need for low network diameter and the need to limit the number of interprocessor communication links. Moreover, vision problems are especially difficult since they place different requirements on the underlying interconnection network topology and the mode or granularity of parallelism (i.e., SIMD, SPMD, or MIMD) depending on whether the problem can be classified as one of low-, intermediate-, or high-level vision. No fixed topology interconnection network with a given mode or granularity of parallelism has proven effective at tackling computer vision problems at all levels of abstraction, i.e., low-, intermediate-, and high-level vision.
Reconfigurable networks attempt to address the aforementioned tradeoff between the need for low network diameter and the need to limit the number of interprocessor communication links. In a reconfigurable network each node has a bounded degree of connectivity but the network diameter is restricted by allowing the network to reconfigure itself into different configurations. Examples of reconfigurable multiprocessor systems include the Polymorphic Torus [23] , [24] , Gated-Connection Network (GCN) [18] , [36] , CLIP7 [13] , PAPIA2 [1] , Reconfigurable Bus Architecture (RBA) [26] , and the Reconfigurable Mesh Architecture (RMA) [27] . A considerable amount of research in recent times has been devoted in attempts to show that these reconfigurable systems are well-suited for computer vision problems.
Broadly speaking, a reconfigurable system needs to satisfy the following properties in order to be considered practically feasible: 1) In each configuration the nodes in the network should have a reasonable degree of connectivity with respect to the number of processors in the network (i.e., network size). This is to ensure that the number of interprocessor communication links does not grow very rapidly with network size. It is desirable that the number of interprocessor communication links scale subquadratically (preferably linearly) with respect to the network size.
2) The network diameter should be kept low via the reconfiguration mechanism. The network diameter should scale sublinearly (preferably logarithmically) with respect to the network size.
3) The hardware for the reconfiguration mechanism (i.e., reconfiguration switch) should be of reasonable complexity. 4) The algorithmic complexity of the reconfiguration operation should be low.
In this paper, we describe a novel reconfigurable architecture which we term as the Reconfigurable MultiRing Network (RMRN) [2] , [5] , [6] . The RMRN is shown to be highly scalable and amenable to VLSI implementation. The RMRN results in a physically compact multiprocessor system, which is an especially important criterion for computer vision on mobile and autonomous robotic platforms. We prove some important properties of the RMRN topology. As a result, we show that a broad class of algorithms for the 2D mesh and the n-cube can be mapped to the RMRN in a simple and elegant manner. We design and analyze a class of procedural primitives for the RMRN and show how these primitives can be used as building blocks for more complex parallel operations. We show that the RMRN can support major parallelization strategies (and their various combinations) such as functional parallelism (i.e., pipelining), data parallelism, and control parallelism and is effective in both the SIMD (Single Instruction Multiple Data) and the SPMD (Single Program Multiple Data) modes of parallelism. We demonstrate the usefulness of the RMRN for problems in low-and intermediate-level computer vision by considering typical operations such as the Fast Fourier Transform (FFT), convolution, template matching, and the Hough Transform. Timing results for these operations on a transputer-based prototype are also presented.
The organization of the remainder of this paper is as follows: In Section 2, we describe the basic topology of the RMRN. We also state and prove some important properties of the RMRN. In Section 3, we design some basic procedural primitives on the RMRN which could be used as building blocks for more complex parallel algorithms. In Section 4, we present algorithms on the RMRN for typical low-and intermediate-level vision operations such as the FFT, convolution, template matching, and the Hough Transform. In Section 5, we describe a transputer-based prototype RMRN and the switching hardware used. We also present a performance evaluation of the prototype. In Section 6, we conclude the paper and outline future directions.
THE BASIC TOPOLOGY AND PROPERTIES OF THE RMRN
In this section, we define the RMRN topology in precise mathematical terms and also give a functional description of the hardware needed for enabling reconfigurability and providing the input/output connections.
Basic Definitions
Let RMRN n denote an RMRN with N = 2 n processors. The processors are numbered 0, 1, 2, º, (N − 1). Each processor p in the RMRN is uniquely specified using an n bit address (p 0 , p 1 , º p n−1 ). The RMRN n has n + 1 different configurations where each configuration is denoted by con-
fashion such that each ring has k = 2 n-i processors. For 0 < j < r − 1 every processor in R j is connected to a processor in R j+1 and R j−1 . The input of each processor in R 0 is multiplexed between an external input channel and the output of a processor in R r−1 . Analogously, the output of each processor in R r−1 is demultiplexed between an external output channel and the input of a processor in R 0 .
Given that the RMRN n is in configuration config(RMRN n , i), a processor p is in ring R j iff p mod r = j. Also, processor p is connected to processors ((p + r) mod N) and ((p − r) mod N) in ring R j via bidirectional links. Furthermore, we say that processor p is in position q in the ring R j iff p div r = q. If 0 < j < r − 1 then there are bidirectional links between processors p and p + 1 and between processors p and p − (Fig. 1a) . In config(RMRN 4 , 2) we have four rings R 0 , R 1 , R 2 , and R 3 where R 0 consists of processors {0, 4, 8, 12}, R 1 consists of processors {1, 5, 9, 13}, R 2 consists of processors {2, 6, 10, 14}, and R 3 consists of processors {3, 7, 11, 15} (Fig. 1b) . 
Connections to the External Input/Output Channels
The input and output control switches that provide connections to the external input and output channels respectively are straightforward to design. These switches are designed to check whether the last i bits of the processor address p are all 0 or 1, respectively. In config(RMRN n , i) with r = 2 i , there is a link between the external input channel and processor p iff p mod r = p mod 2 i = 0 (i.e., p is in R 0 ) and the input control signal IN is high else p is connected to the output of processor (p − 1) mod N. Conversely, iff p mod r = p mod 2 i = r − 1 and the output control signal OUT is high, then processor p is connected to the external output channel, else processor p is connected to the input of processor (p + 1) mod N. Calculating mod 2 i is equivalent to examining the last i bits of the n bit address of processor p, i.e., p mod 2 i = 0 or p mod 2 i = r − 1 iff the last i bits of p are all 0 or all 1, respectively. Thus, the input control switch needs to check whether or not the least significant i bits of the processor address are all 0 whereas the output control switch needs to check whether or not the least significant i bits of the processor address are all 1. The hardware for the input/output switches is described in Section 5.1.
The Reconfiguration Switching Network
A multistage switching network enables the RMRN to reconfigure itself from config(RMRN n , i) to config(RMRN n , j) where
The function that the switching network needs to perform is related to that of a perfect shuffle network [37] and the barrel shifting network [25] , but it is not exactly identical to either of them. The switching network is built up in the same way a shuffle network is, and, in fact, does contain some shuffle and unshuffle constituents. However, since it does not perform at all like a shuffle, it is unrelated functionally to the shuffle network. The relation to the barrel shifter is functional. The switching network serves to configure connections in a manner similar to how a barrel shifter would configure the bits of an operand. However, it is not constructed at all like a barrel shifter, so it is unrelated structurally to the barrel shifter. Section 5.2 details the hardware implementation of the reconfiguration switching network. The switching network bears structural and functional resemblance to the other hypercubic multistage interconnection networks such as the butterfly network [4] , omega network [15] , [19] , [21] , [30] , delta network [28] , baseline network [39] , and the banyan network [14] . It can be shown in a straightforward manner that the interconnection links in a single stage of the butterfly network are contained within a specific configuration of the RMRN topology. This property can be proved in a manner analogous to the proof of Property 4 presented in the following subsection. Leighton [22] has shown the other commonly encountered multistage interconnection networks, i.e., omega, baseline, delta, and banyan, to be variants of the butterfly network. In [22] , Leighton has presented a class of graph similarity transformations for systematically transforming the other networks into an equivalent representation in terms of the butterfly network. This implies that the various stages of the omega, baseline, delta, and banyan networks can also be shown to be contained within specific configurations of the RMRN topology.
Since the reconfiguration switching network for the RMRN is a hierarchical multistage network like the butterfly network, the number of elementary switching elements needed to implement a reconfiguration switching network for N processors scales as O (N log 2 N) . Although the asymptotic hardware complexity of O(N log 2 N) is not attractive for very large values of N, it is manageable for moderate values of N. The O(N log 2 N) hardware complexity is offset by the fact that the switching network is modular and hierarchically expandable and the expansion can be done using a single logical component as a building block. For example, one could use the switching network for RMRN 4 as a building block for larger systems. One of the most important properties of the switching network is that each processor in RMRN n needs four bidirectional channels irrespective of the value of N = 2 n . Since the connectivity of each node in RMRN n is fixed, the number of interprocessor communication links grows linearly with the system size N = 2 n . The RMRN thus satisfies an important criterion for a general purpose reconfigurable multiprocessor. In the following sections we explore some other important properties of the RMRN which underscore its viability for computer vision problems.
Basic Properties of the RMRN Topology
We state and prove some important properties of the RMRN topology. PROOF. Consider a grid with 2 i rows and 2 j columns such that i + j = n. Consider a position (k, l) in the grid. Under a row-major mapping, the processor address at position (k, l) is given by
The processor p is connected to its four nearest neighbors at locations (k − 1, l), (k + 1, l), (k, l − 1), and (k, l + 1) with the corresponding processors at these locations denoted by p n , p s , p e , and p w , respectively. Under the row-major mapping the processor addresses p n , p s , p e , and p w are given by: 
e j e j e j .
Since p div 2 n-m = p¢ and 2 
respectively, in config(RMRN n , i). The statements A(p) AE A(p(i)) and A(p) ¨ A(p(i))
in an algorithm for the ncube which move the contents of the A register of p to the A register of p(i) and vice versa along the ith dimension edge (p, p(i)) are, respectively, equivalent to the following statements on the RMRN n :
This completes the proof of Property 4.
In summary, Property 1 states that the RMRN n can be configured into a variety of mesh topologies. In fact, we have proved that the RMRN n in config(n, i) contains as its subset a 2 i ¥ 2 n-i processor 2D wrap-around (i.e., toroidal) mesh. Property 2 shows how the n-cube can be used to simulate the behavior and function of the RMRN n . In fact, any given configuration of the RMRN n is a subset of the ncube. Property 3 shows that the RMRN n possesses an elegant recursive property with regard to its structure in a manner similar to the n-cube. In general, config(RMRN n , i) would contain 2 n-m subconfigurations of the type con- This property is important because it enables recursive decomposition of a procedure into independent subprocedures which could then be executed concurrently on individual subconfigurations. The usefulness of this property will be brought out in Sections 3 and 4 wherein we describe certain imaging operations that are to be performed in parallel on subimages or windows within the entire image. A subimage or window will be shown to essentially define a subconfiguration within the RMRN. Property 4 shows that the edges along a given dimension of the n-cube are contained within a specific configuration of the RMRN n . This result is of special significance since it allows a wide class of algorithms designed for the ncube to be mapped onto the RMRN n without loss of performance, assuming that the overhead in reconfiguration is not excessive. Any algorithm which uses the communication links along a single dimension of the n-cube at any given point in time can be mapped to the RMRN n in O(1) time.
In summary, the RMRN n in any single configuration is a proper subset of the n-cube, whereas the edges along a specific dimension of the n-cube are a proper subset of the RMRN n . This implies that the RMRN n in any given configuration is more restrictive and hence less powerful than the n-cube. However, it should be noted that a vast majority of algorithms that use the n-cube in the SIMD or SPMD (and sometimes even in the MIMD) modes of parallelism, use the edges of the n-cube along one specific dimension at a given time. One can therefore conclude that the RMRN topology, on account of its reconfigurability, provides the same generality in practice as the n-cube for a large class of problems. In fact, the RMRN can be seen to offer a costeffective alternative to the n-cube architecture without substantial loss in performance.
SOME BASIC OPERATIONS AND ALGORITHMS ON THE RMRN
The RMRN that has been designed and simulated to date can be operated in the SIMD and the SPMD modes of parallelism. We shall prove that a wide class of basic opera-tions and algorithms can be easily implemented on RMRN n . Each processor in the RMRN has four bidirectional channels which are denoted as left, right, next, and previous. The processors in the RMRN multiprocessor can operate in either the SIMD or the SPMD mode of parallelism. Each processor has its own local memory for data storage and, in the case of the SPMD mode, also program storage. In the SIMD mode, a single control unit broadcasts a common instruction stream to all the processors. Each processor can either execute the current instruction or ignore it altogether depending on the state of the variables in its local memory. The control unit also issues the command(s) to the reconfiguration switch to reconfigure the RMRN multiprocessor in a given configuration. In the SIMD mode, all the processors and the reconfiguration switch are constrained to operate in lock-step synchronism. In the SPMD mode of parallelism, each processor runs its local program asynchronously on its local data. However, all the processors are synchronized at each reconfigure command using barrier synchronization. A failure to do so would cause an inconsistency if different processors assume different states of the interconnection network and also switch-level contention if the reconfiguration switch is forced to route messages in two different reconfiguration states at the same time. We denote intraprocessor assignments by := whereas AE is used to denote interprocessor assignments. Interprocessor assignments utilize the interprocessor links in the RMRN. The term unit hop is used to denote communication between processors in the RMRN that are directly connected. The asymptotic complexity of any algorithm for the RMRN is decided by the number of unit hops in the algorithm if each unit hop involves O(1) (i.e., constant) amount of data transfer [22] .
General Message Passing Operations

Broadcast Operation
Assume that the data in register X of processor 0 needs to be broadcast to all the other processors. Reconfigurability permits the broadcast operation to be performed in O(n), that is O(log 2 N) unit hops where each unit hop entails O(1) data transfer. In fact, the diameter of the RMRN n network with reconfigurability can be shown to be n = log 2 N. The broadcast algorithm is shown in Fig. 2 . This algorithm describes a simple broadcast wherein a data item in a processor 0 is broadcast to all the other processors in the RMRN. The function MSONE(p) denotes the position of the most significant 1 in the binary processor address p. For example, MSONE(5) = MSONE(0101) = 2, and MSONE(9) = MSONE(1001) = 3. We define MSONE(0) = −1. The simple broadcast operation for RMRN 3 is depicted in Fig. 3 . We describe below each step in the broadcast operation for RMRN 3 . For each step we list the addresses of the source and destination processor(s). Note that for step i, 0 £ i < n, the corresponding configuration is config(RMRN n , i), the destination processor(s) p is (are) such that MSONE(p) = i and the corresponding source processors(s) is (are) Step 0: Configuration 0, MSONE(p) = 0 0 (000) AE 1 (001)
Step 1: Configuration 1, MSONE(p) = 1 0 (000) AE 2 (010) 1 (001) AE 3 (011)
Step 2: Configuration 2,
One can also design variants of the simple broadcast operation for image processing or computer vision applications. These operations could be listed as: 1) Image Tile Broadcast: In this operation, an image is decomposed into tiles. All the tiles are initially resident on a single processor of the RMRN. These tiles are then distributed via the broadcast operation to individual processors such that each processor has a single tile. The image tile broadcast is depicted in Fig. 4 . 2) All-To-All Broadcast (Gossiping): A data item d i is resident on each of the processors p i of the RMRN n . The all-to-all broadcast enables each processor p i to have a copy of all the data items d i , 0 £ i < N. The allto-all broadcast is depicted in Fig. 5 . 
In the case of the all-to-all broadcast, at step i, where 0 £ i < n, 2 i image tiles are transmitted over each of the interprocessor communication links (Fig. 5 ). The total time taken for the all-to-all broadcast is given by:
e j a f a f (2) Although the image tile broadcast and the all-to-all broadcast take the same amount of time on the RMRN, in the case of the all-to-all broadcast, unlike the image tile broadcast, each interprocessor communication link is busy all the time (Figs. 4 and 5) . The complexity of the image tile broadcast and all-to-all broadcast is O(N) on the RMRN n . In practice, techniques like wormhole routing or cut-through routing could be used to cut down on the actual broadcast time. The algorithms for the image-tile broadcast and the all-to-all broadcast are similar to the algorithm for the simple broadcast in Fig. 2 .
Combine Operation
Let ≈ denote any associative binary operation such as MAX, MIN, logical AND, logical OR, sum, or product. Let each processor contain a data item in its X register. We are Combine(X, n, F);
{F is an associative binary operation} ENDFOR; END; The combine operation for RMRN 3 is depicted in Fig. 7 . We describe below each step in the combine operation for RMRN 3 . For each step we list the addresses of the source and destination processor(s). Note that for step i, 0 £ i < n, the corresponding configuration is config (RMRN n , i) , the source processor(s) p is (are) such that LSONE(p) = i and the corresponding destination processors(s) is (are)
The destination processor q evaluates the associative binary operation F(x, y) at each step. Step 0: Configuration 0, LSONE(p) = 0 1 (001) AE 0 (000) 3 (011) AE 2 (010) 5 (101) AE 4 (100) 7 (111) AE 6 (110) Step 1: Configuration 1, LSONE(p) = 1 2 (010) AE 0 (000) 6 (110) AE 4 (100) Step 2: Configuration 2, LSONE(p) = 2 4 (100) AE 0 (000)
Data Circulation
Let us consider the operation of circulating the data in the X register of each processor in RMRN n through the remaining N − 1 processors. We define an exchange sequence X n as follows [11] :
Note that X n is a palindromic sequence of length 2 n − 1 = N − 1 where each integer in the sequence is in the range [0, n − 1]. The sequence X n is such that by successively complementing the processor address along each bit in the sequence, each data item in each processor is made to pass through all the remaining processors in the RMRN n . Also, since each X i is a palindrome, X n can be computed in O(n) = O(log 2 N) time and stored in a stack of height N − 1 [11] . Let f(i, j) denote the jth member in the sequence X i (from left to right). The procedure for data circulation in RMRN n is given in Fig. 8 . The data circulate operation takes O(N) unit hops on the RMRN n . Alternatively, the data circulation operation can be accomplished using a single configuration of the RMRN n , i.e., config(RMRN n , 0) which consists of a single ring of N = 2 n processors. The O(N) algorithm for the data circulate operation which uses a single configuration of the RMRN n is outlined in Fig. 9 . The advantage of this algorithm over the one in Fig. 8 is that the latter entails O(N) calls to config(◊, ◊) whereas the former involves a single call to config(◊, ◊). Since reconfiguration in a practical system involves some overhead, the algorithm in Fig. 9 could be expected to run faster than the one in Fig. 8 
although both algorithms have an asymptotic complexity of O(N).
Circulate(X, n) BEGIN config(n, 0); FOR i:= 0 to (2**n) -1 DO X(p) -> X(right(p)); END; Fig. 9 . Data circulation using a single configuration.
Some Basic Imaging Operations on the RMRN
Given a 2D image G = {G(i, j); i, j OE [0, N − 1]} and an RMRN with N 2 = 2 2n processors where the value of the pixel (i, j) is stored in register R(p) of processor p = iN + j (i.e., row-major mapping). We will occasionally refer to the processor p = i N + j by the ordered pair (i, j) keeping in mind that the rowmajor mapping is a bijection since i = p div N and j = p mod N. Similarly, we will also occasionally refer to the register R(p) as R(i, j).
Rotate Operation Within a Window
One of the basic imaging operations that could be performed on the RMRN is the rotate or the cyclic shift operation performed in a subimage or window within the entire image. The window essentially defines a subconfiguration within the RMRN. The rotate operation is then carried out in parallel in each window (i.e., by the processors in each subconfiguration). If we are interested in config(w, k) in a window of size W = 2 w , then RMRN n has to be placed in config(n, n − w + k). This ensures that config(w, k) is a proper subconfiguration of config(n, n − w + k) (Property 3). In RMRN n with N = 2 n , processors a window of size W = 2 w is defined by appropriately assigning values to (n − w) bits in the address of the processors. The address of the processor within the window is given by appropriately masking the lower order (n − w) bits in the address of the processor (Property 3). Let p{w} denote the address of the processor after the masking operation for a window of size W = 2 w has been carried out for RMRN n . We can see that p{w}, right(p{w}), and left(p{w}) in config(w, k) refer to processors p, right(p), and left(p) respectively in config(n, n − w + k) since the connectivity pattern is preserved in subconfigurations of the RMRN n (Property 3). In RMRN 2n with N 2 = 2 2n processors a window of size
is defined by appropriately assigning values to 2(n − w) bits in the address of the processors. The address of the processor within the window is given by appropriately masking the lower order 2(n − w) bits in the address of the processor. The address of the processor after the masking operation for a window of size W ¥ W = 2 2w has been carried out is denoted by p{2w}. The kth bit in the masked address is then denoted as p{2w} [k] .
The rotate operation within a window could be described thus:
A 1D rotate operation by 2 k pixels in a window of size W 2 = 2 2w can be described as:
A 2D rotate operation by 2 k pixels can be described as:
A generic rotate operation within a window is described in Fig. 10 . The algorithms for the broadcast, combine and circulate operations within a window are similar to the ones already described and hence will not be repeated here.
Accumulate Operation
Each processor j has an array A[0 º M − 1] of size M. A[i](j)
refers to the element A[i] in processor j. In addition, each processor has a value in its I register. After the accumulate operation, the M elements of the array A in each processor j are such that:
w is the window size.
The accumulate operation can be performed as a minor modification of the circulate operation within a window using the following lemma from Ranka and Sahni [33] 
LOW-AND INTERMEDIATE-LEVEL VISION OPERATIONS ON THE RMRN
The primitive operations defined in the previous section can be extended to design operations for low-and intermediate-level vision on the RMRN. Typical examples of lowlevel vision operations are image transforms such as the Fast Fourier Transform (FFT), and convolution for edge detection, image filtering, image smoothing, and feature detection via template matching. A typical example of intermediate-level vision is feature extraction via the Hough Transform. We consider the implementation of these operations on the RMRN.
The Fast Fourier Transform
For the purpose of discussion, we have selected one of the most widely known decimation-in-frequency FFT algorithms described in [16] . First, we describe how the onedimensional FFT algorithm can exploit the RMRN system and then address the problem of expanding the algorithm to handle the two-dimensional case. A process known as the butterfly (Fig. 12a) Fig. 12b where the data is considered to flow from left to right. The FFT algorithm described in Fig. 13 , performs the Mpoint FFT calculations using log 2 (M/2) = log 2 N parallel data transfers (unit hops) on the RMRN n with N = 2 n processors. This is a lower bound on the number of parallel data transfers required to perform an M-point FFT when the M points are initially distributed over M/2 processors [16] . The number of parallel butterfly operations performed is log 2 M, where each butterfly involves two complex additions and one complex multiplication in each processor. The asymptotic complexity of the algorithm is O(log 2 (M/2)) = O(log 2 N). The 2D FFT of an N ¥ N image can be computed by first computing the 1D FFT of each row of the image followed by the 1D FFT of each column of the resulting image.
Convolution for Edge and Feature Detection
The convolution operation is a fundamental operation used in both edge and feature detection. In both cases, the underlying image is convolved with a template or a set of templates and the edges or features of interest are deemed to be points in the image where the output of the convolution has a maximum. A one-dimensional convolution is given by the relation: 
where
the N ¥ N image resulting from the convolution. The two-dimensional convolution can be decomposed into a series of 1D convolutions and summations [31] , [32] :
For the 1D convolution operation we assume that there are N processors in the RMRN and the vector I is mapped onto the RMRN using the identity mapping, i.e., I(i) is mapped onto processor i. We also assume that there are (N/M) copies of the template T on the RMRN with one copy in each subconfiguration of M processors. Within each subconfiguration, the mapping of T is identical to the mapping of I. Since each processor has O(M) memory, the most effective strategy is to perform a data accumulate operation on the I values such that each processor has all the I values necessary to compute a single value of the output vector C. The mapping of the output vector C on the RMRN is identical to the mapping of the input vector I. Let f(n, i) denote the ith number in the palindromic exchange sequence in (3). The code for 1D convolution with O(M) memory per processor is given in Fig. 14 . The template is stored in the T register and the final result in the C register of each processor.
For the 2D convolution operation we assume that an RMRN 2n with N 2 = 2 2n processors is available. We assume an identity mapping for the image I, i.e., I(i, j) is contained in the I register of the (i, j)th processor. We also assume that there are (N/M) 2 copies of the template T; one copy in each subconfiguration of M 2 processors. Within each subconfiguration the mapping of T(I, j) is identical to the mapping of I(i, j) . The mapping of the output image C(i, j) is also identical to that of I(i, j). The code for the 2D convolution can be derived by decomposing it into a series of 1D convolutions and summations using (6) . Both the 1D and 2D convolutions can be performed on the RMRN in with the same asymptotic order of complexity as would result by implementing them on the hypercube [31] , [32] , [20] , [12] in an SIMD or SPMD mode parallelism. The 1D convolution can be performed in O(M + log 2 M + log 2 N) unit hops whereas the 2D convolution can be performed in O(M 2 + log 2 M + log 2 N) unit hops. 
The Hough Transform
The Hough Transform is known to be a very important though computationally intensive operation in computer vision and image processing. The conventional Hough Transform is used to detect and extract features with well defined parametric descriptions such as lines, circles, ellipses, etc., in an image. Serial implementation of the Hough Transform on a uniprocessor architecture is computationally intensive and is not feasible for real-time applications. Thus, parallelization of the Hough Transform is imperative and has been attempted by several researchers on a variety of architectures, such as the mesh [34] , [10] , hypercube [33] , pyramid [35] , butterfly network [7] , tree network [17] , systolic array [9] , and shared memory architecture [8] . In this section, we consider the parallelization of the Hough Transform for line detection on the RMRN. = , 0 £ y < Y, and E(i, j) = 1. We assume that q has been initialized to zero. The triple (x, y, q) is stored in the VOTES register of each processor. Each element within the triple is referred to as VOTES.x, VOTES.y, and VOTES.q, respectively. Let W denote the window corresponding to the subconfiguration RMRN n+m+1 . The key factor in designing an efficient algorithm for Phase 1 is to realize that not all pixels E(i, j) in a given row of the Y ¥ 2N Hough accumulator array h(x, y) contribute towards the vote count of a given value of (x, y) (i.e., a given bin in the Hough accumulator array) [10] , [33] . We subdivide Phase 1 in the Hough Transform algorithm in the following subphases: 
IMPLEMENTATION AND PERFORMANCE EVALUATION OF THE RMRN
We have built a prototype RMRN using a reconfigurable network of INMOS T400 transputers. The switches needed for input/output and reconfiguration were built using offthe-shelf standard TTL components and cross-bar switches. The transputers were used in the SPMD mode of parallelism and were programmed using Logical Systems C. Each node in the prototype RMRN is an INMOS T400 Transputer which has a 32 bit 10 MIPS RISC processor, 2 KB of internal (i.e., on-chip) RAM, and an external memory interface for off-chip memory access. The image data is one byte per pixel and resides entirely in the external memory.
Hardware Implementation of the Input/Output Switches
As mentioned in Section 2.2, in config(RMRN n , i) the input control switch needs to check whether or not the least significant i bits of the processor address are all 0. Conversely, in config(RMRN n , i) the output control switch needs to check whether or not the least significant i bits of the processor address are all 1. The circuit for the input control switch is given in Fig. 17 . The bit select signal (s n−1 , s n−2 , º, s 0 ) in Fig. 17 determines the value of i (i.e., the number of bits to be tested). In config(RMRN n , i) we set s i−1 = s i−2 = º = s 0 = 1 and s n−1 = s n−2 = º = s i = 0. It is easy to see that the circuit in Fig. 17 determines whether the last i bits of p are all 0 or not. If the last i bits of p are all 0 and the input control signal IN is high, then the circuit in Fig. 17 connects p to the input channel otherwise it connects p to the output of (p − 1) mod N. Similarly, the output control switch shown in Fig. 18 checks to see whether the last i bits of p are all 1 or not. If the last i bits of p are all 1 and the control signal OUT is high, then the output control switch connects p to the output channel else it connects p to the input of (p + 1) mod N. The setting for the bit select signal (s n−1 , s n−2 , º, s 0 ) is the same as that for the input control switch. In the current prototype of the RMRN, the input/output control switches are implemented using standard TTL components. 
Hardware Implementation of the Reconfiguration Switching Network
As mentioned in Section 2.3, the reconfiguration switching network for the RMRN is a modular, multistage network similar to the butterfly network. Consider the circuit SR 1 shown in Fig. 19a made up of standard multiplexer elements. The truth table of SR 1 can be seen to be:
With Z 0 = A 0 , Z 1 = A 1 , and X = A 0 , the switch SR 1 can be converted to SW 1 which is, in fact, the switching network for RMRN 1 (Fig. 19b) . We now give a recursive definition for the switch SW n that could be used in the switching network for RMRN n .
We assume that we have constructed a switch SR n by cas-
SR 1 switches as shown in Fig. 20 . We also assume that we have constructed a switch SW n−1 which is used in the switching network for RMRN n−1 with 2 1 2 n N -= processors. Fig. 21 shows how the switch SW n could be constructed from a single SR n switch and two SW n−1 switches.
Here, SHFL(n) denotes a perfect shuffle of 2 n elements. The switch SW n followed by a perfect unshuffle of its 2 n outputs (denoted by UNSHFL(n)) constitutes the reconfiguration switch RSW n for RMRN n as shown in Fig. 22 . The switching network for the RMRN provides point-topoint connections without contention, is modular, composed of fairly simple components, hierarchical and incrementally upgradable. We have sized the RSW 4 switch made from TTL gates (50 packages, 80 nsec total delay) and from TTL MSI logic [38] (16 packages, 72 nsec total delay). Since increased integration makes the switching network both smaller and faster, we have examined the possibility of a custom integrated circuit to provide the switching function [3] . According to the timing analysis from the simulation of the VLSI implementation, the total delay introduced by the switching network is less than 15 nsec from input to output for a 16 processor RMRN. The speed, simplicity, and capability of the VLSI implementation of the switching network make a cost-effective reconfigurable parallel multiprocessor system possible. The reconfiguration switching network in the current prototype has been constructed using TTL MSI logic and standard crossbar switches and therefore leaves room for improvement using a custom CMOS VLSI implementation.
Performance Evaluation
We have obtained the timings for various operations on the prototype RMRN. These are tabulated in Tables 1, 2 , 3, 4, 5, 6, 7. The tile size in the case of the image-tile broadcast and the all-to-all broadcast, and the window size in the case of the 1D rotate operation and the Hough Transform were decided by the size of the image and the number of processors in the RMRN. For example, for a 1K ¥ 1K image and a 16 processor RMRN, the resulting tile (window) size was 64K = 256 ¥ 256. In the case of the Hough Transform, the angle q OE [0, p) was quantized in steps of 1 degree or p/180 radians resulting in a total of 180 quantization levels.
We note that the timings that we have obtained from our prototype have been fairly encouraging in spite of the fact that the hardware used in the prototype is not the latest or the fastest available. We estimate that if each node of the RMRN is a T9000 transputer (the latest in the INMOS transputer series), the timings for each of the operations discussed in the previous subsections would reduce by a factor in the range [8, 20] . Operations such as the Hough Transform that are computation intensive would exhibit speedup factors close to 20 by switching over to the T9000 transputer whereas broadcast operations that are interprocessor communications intensive would exhibit speedup factors that are closer to 8. The RMRN need not be limited to a transputer-based implementation; one could also envisage an RMRN implemented using a cluster of workstations or personal computers interconnected via the reconfiguration switch shown in Fig. 22 . However, if one is interested in a physically compact multiprocessor system that could be used for computer vision on a mobile, autonomous or dextrous robotic platform, a transputer-based implementation of the RMRN seems to be an appropriate choice.
In summary, the RMRN offers an interconnection network for parallel/distributed processing for computer vision problems that is: 1) general and flexible since the compute nodes can be processors or computing platforms of any type, manufacture or make and since software written for other prototypical architectures such as the hypercube and the 2D toroidal mesh can be readily implemented on the RMRN, 2) powerful since the RMRN can implement the standard parallelization strategies such as process decomposition, data decomposition, and functional decomposition (i.e., pipelining) and any combination(s) thereof, 3) efficient since the diameter scales of the RMRN scales as O(log 2 N) ensuring that the algorithms designed for the RMRN exhibit good speedup characteristics and high efficiency, 4) scalable since the interprocessor communication links scales as O(N) ensuring thereby that the cost of the system also scales well with increasing network size, 5) cost-effective since the RMRN can be built to exploit readily available state-of-the-art processor and switching technology, and 6) upgradable since the RMRN can be readily upgraded to take advantage of advances in VLSI and processor technology and faster interprocessor communication links.
A fixed topology architecture such as a mesh or an n-cube would, in general, prove more efficient than the RMRN for those image processing and computer vision operations that are tailored for that specific architecture since no reconfiguration overhead is entailed. The RMRN on account of its reconfigurability, however, would provide greater flexibility in algorithm design than would a fixed topology architecture. for a system of size N. In the case of the n-cube and the mesh, one is confronted with a trade-off between computational efficiency and cost effectiveness which one can circumvent in the case of the RMRN.
CONCLUSIONS
In this paper, we have described a reconfigurable multiring network (RMRN) and highlighted some important properties of the RMRN structure. We have shown that a broad class of algorithms for the n-cube can be mapped to the RMRN n in a simple and elegant manner. We have designed and analyzed a general class of procedural primitives for the RMRN and shown how these primitives can be used as building blocks for more complex operations for computer vision. The RMRN is shown to be a highly scalable architecture with a O(log 2 N) diameter and requiring O(N) communication links for a network size of N. We have shown that the RMRN can support major parallelization strategies such as functional parallelism (i.e., pipelining), data parallelism, and control parallelism, and various combinations thereof. The RMRN can be used in both the SIMD (Single Instruction Multiple Data) and the SPMD (Single Program Multiple Data) modes of parallelism. We have demonstrated the usefulness of the RMRN for computer vision by considering important operations in low-and intermediate-level vision such as the FFT, edge detection, and the Hough Transform. A prototype RMRN using T400 transputers as compute nodes in the RMRN was discussed. Timing results for the aforementioned low-and intermediate-level vision operations on the prototype RMRN indicate that the RMRN is a viable architecture for problems in computer vision. The RMRN results in a cost-effective, versatile and physically compact multiprocessor system amenable to VLSI implementation. Efforts are currently underway towards building the hardware for the RMRN using INMOS T9000 transputers and custom VLSI. Complex problems in intermediate-and high-level vision such as binocular stereo, surface reconstruction, image segmentation, and object recognition are also being considered.
