Abstract-A matrix A of size m n containing items from a totally ordered universe is termed monotone if, for every i, j, 1 i < j m, the minimum value in row j lies below or to the right of the minimum in row i. Monotone matrices, and variations thereof, are known to have many important applications. In particular, the problem of computing the row minima of a monotone matrix is of import in image processing, pattern recognition, text editing, facility location, optimization, and VLSI. Our first main contribution is to exhibit a number of nontrivial lower bounds for matrix search problems. These lower bound results hold for arbitrary, infinite, two-dimensional reconfigurable meshes as long as the input is pretiled onto a contiguous n n submesh thereof. Specifically, in this context, we show that every algorithm that solves the problem of computing the minimum of an n n matrix must take W(log log n) time. The same lower bound is shown to hold for the problem of computing the minimum in each row of an arbitrary n n matrix. As a byproduct, we obtain an W(log log n) time lower bound for the problem of selecting the kth smallest item in a monotone matrix, thus extending the best previously known lower bound for selection on the reconfigurable mesh. Finally, we show an W loglogn ⑀ for some constant ⑀, (0 < ⑀ 1), our algorithm runs in O(log log n) time.
INTRODUCTION
ECENTLY, in an attempt to reduce its large computational diameter, the mesh-connected architecture has been enhanced with various broadcasting capabilities. Some of these involve endowing the mesh with static buses, that is, buses whose configuration is fixed and cannot change; more recently, researches have proposed augmenting the mesh architecture with reconfigurable broadcasting buses: These are high-speed buses whose configuration can be dynamically changed in response to specific processing needs. Examples include the bus automaton [25] , [26] , the reconfigurable mesh [21] , the mesh with bypass capability [12] , the content addressable array processor [31] , the reconfigurable network [7] , the polymorphic processor array [16] , [20] , the reconfigurable bus with shift switching [15] , the gatedconnection network [27] , [28] , the PARBS [30] , and the polymorphic torus network [13] , [17] . We refer the interested reader to the comprehensive survey paper of Nakano [22] .
Among these architectures, the reconfigurable mesh and its variants have turned out to be valuable theoretical models that allowed researchers to fathom the power of reconfiguration and its relationship with the PRAM. From a practical standpoint, however, the reconfigurable mesh and its variants [21] , [30] omit important properties of physical architectures and, consequently, do not provide a complete and precise characterization of real systems. Moreover, these models are so flexible and powerful that it has turned out to be impossible to derive from them high-level programming models that reflect their flexibility and intrinsic power [16] , [20] . Worse yet, it has recently been shown that the reconfigurable mesh and the PARBS do not scale and, as a consequence, do not immediately support virtual parallelism [18] , [19] .
Motivated by the goal of developing algorithms in a scalable model of computation, we adopt a restricted version of the reconfigurable mesh that we call the basic reconfigurable mesh (BRM, for short). Our model is derived from the Polymorphic Processor Array (PPA) proposed in [16] , [20] : The BRM shares with the PPA all the restrictions on the reconfigurability and the directionality of the bus system. The BRM differs from the PPA in that we do not allow torus connections. As a result, the BRM is potentially weaker than the PPA. It is very important to stress that the programming model developed in [16] , [20] for the PPA also applies to the BRM. In particular, all the broadcast primitives developed in [16] , [20] , with the exception of those using torus connections, can be inherited by the BRM. In fact, all the algorithms developed in this paper could have been just as easily written using the extended C language primitives of [16] , [20] . We opted for specifying our algorithm in a more conventional fashion only to make the presentation easier to follow.
Consider a two-dimensional array (i.e., a matrix) A of size m n with items from a totally ordered universe. Matrix A is termed monotone if, for every i, j, 1 i < j m, the smallest value in row j lies below or to the right of the smallest value in row i, as illustrated in Table 1 , where the row minima are highlighted. A matrix A is said to be totally monotone if every submatrix of A is monotone. The concepts of monotone and totally monotone matrices may seem artificial and contrived at first. Rather surprisingly, however, these concepts have found dozens of applications to problems in optimization, VLSI design, facility location problems, string editing, pattern recognition, computational geometry, and cellular system design, among many others. The reader is referred to [1] , [2] , [3] , [4] , [5] , [6] , where many of these applications are discussed in detail.
One of the recurring problems involving matrix searching is referred to as row-minima computation [6] . In particular, Aggarwal et al. [2] showed that the task of computing the row-minima of an m n monotone matrix has a sequential lower bound of W(n log m). They also showed that this lower bound is tight by exhibiting a sequential algorithm for the row-minima problem running in O(n log m) time. In the case where the matrix is totally monotone, the sequential complexity is reduced to Q(m + n).
To the best of our knowledge, no time lower bound for the row-minima problem has been obtained in parallel models of computation, in spite of the importance of this problem. The first main contribution of this paper is to propose a number of nontrivial time lower bounds for matrix search problems. These lower bounds hold for general twodimensional reconfigurable meshes of infinite size, as long as the input is pretiled onto a contiguous submesh of size n n. Specifically, in this context, we show that every algorithm that solves the problem of computing the smallest item of an n n matrix must take W(log log n) time. The same lower bound is shown to hold for the problem of computing the minima in each row of an arbitrary n n matrix. As a byproduct, we obtain an W(log log n) time lower bound for the problem of selecting the kth smallest item in a monotone matrix. Previously, Hao et al. [10] have obtained an W(log log n) lower bound for selection in arbitrary matrices on finite reconfigurable meshes. Thus, our lower bound extends the result of [10] in two directions: We show that the same lower bound applies to selection on monotone matrices and on a reconfigurable mesh of an infinite size. Finally, we show an almost tight W log log n The remainder of this work is organized as follows: Section 2 introduces the model of computations adopted in this paper; Section 3 discusses a number of relevant lower-bound results; Section 4 presents basic algorithms that will be key in our subsequent row-minima algorithm; Section 5 gives the details of our row-minima algorithm; finally, Section 6 offers concluding remarks and poses open problems.
THE BASIC RECONFIGURABLE MESH
A basic reconfigurable mesh (BRM, for short) of size m n consists of mn identical SIMD processors positioned on a rectangular array with m rows and n columns. As usual, it is assumed that every processor knows its own coordinates within the mesh: We let P(i, j) denote the processor placed in row i and column j, with P(1, 1) in the north-west corner of the mesh.
Each processor P(i, j) is connected to its four neighbors P(i -1, j), P(i + 1, j), P(i, j -1), and P(i, j + 1), provided they exist, and has four ports N, S, E, and W, as illustrated in Fig. 1 . Local connections between these ports can be established, subject to the following restrictions: 1) In each time unit, at most one of the pairs of ports (N, S) or (E, W) can be set; moreover, 2) All the processors that connect a pair of ports must connect the same pair; 3) Broadcasting on the resulting subbuses is unidirectional.
For example, if the processors set the (E, W) connection, then, on the resulting horizontal buses, all broadcasting is done either "eastbound" or else "westbound," but not both.
We refer the reader to Figs. 2a and 2b for an illustration of several possible unidirectional subbuses. The BRM is very much like the recently proposed PPA multiprocessor array, except that the BRM does not have the torus connections present in the PPA. In a series of papers [16] , [18] , [19] , [20] , Maresca and his coworkers demonstrated that the PPA architecture and the corresponding programming environment is not only feasible and cost effective to implement, it also enjoys additional features that set it apart from the standard reconfigurable mesh and the PARBS. Specifically, these researchers have argued convincingly that the reconfigurable mesh is too powerful and unrestricted to support virtual parallelism under present-day technology. By contrast, the PPA architecture has been shown to scale and, thus, to support virtual parallelism [16] , [18] . The BRM is easily shown to inherit all these attractive characteristics of the PPA, including the support of virtual parallelism and the C-based programming environment, making it eminently practical. As in [16] , we assume ideal communications along buses (no delay). Although inexact, a series of recent experiments with the PPA [16] and the GCN [27] , [28] seem to indicate that this is a reasonable working hypothesis.
LOWER BOUNDS
The main goal of this section is to demonstrate nontrivial lower bounds for several matrix search problems. Our lower bound arguments do not use the restrictions of the BRM, holding for more powerful reconfigurable meshes that allow any local connections. In fact, our arguments hold for arbitrary two-dimensional, reconfigurable meshes of an infinite size, provided that the input is placed into a contiguous n n submesh thereof.
Formally, this section deals with the following problems: PROBLEM 1. Given an n n matrix pretiled one item per processor onto an n n submesh of an reconfigurable mesh, find the minimum item in the matrix.
PROBLEM 2. Given an n n matrix pretiled one item per processor onto an n n submesh of an reconfigurable mesh, find the minimum item of each row.
PROBLEM 3. Given an n n monotone matrix pretiled one item per processor onto an n n submesh of an reconfigurable mesh, find the minimum item of each row.
PROBLEM 4. Given an n n totally monotone matrix pretiled one item per processor onto an n n submesh of an reconfigurable mesh, find the minimum item of each row.
We propose to show that Problems 1 and 2 have an W(log log n)-time lower bound and that Problem 3 has an W log log n The proofs are based on a technique detailed in [11] , [29] that uses the following graph-theoretic result of Turán [8] .
Recall that an independent set in a graph is a set of pairwise nonadjacent vertices.
This lemma is used in an implicit adversary argument to bound from below the number of items in the matrix that are possible choices for the minimum. Let V be the set of candidates for the minimum at the beginning of the current iteration and let E stand for the set of pairs of candidates that are compared within the current iteration. The situation benefits by being modeled by a graph G = (V, E) with V and E representing, respectively, the vertices and the edges of the graph. It is intuitively obvious that an adversary can choose the outcome of the comparisons in such a way that the next set of candidates is no larger than the size of an independent set U in G. In other words, for a set V of candidates and for a set E of pairs that are compared by a minimum finding algorithm, items in the independent set U have the potential of becoming the minimum. Consequently, all items in U are still candidates for the minimum after comparing all pairs in E.
To make the presentation easier to follow, we assume that each time unit is partitioned into the following three stages: PHASE 1: Bus reconfiguration. The processors set local connections;
PHASE 2: Broadcasting. The processors send at most one data item to each port, and receive one data item from each port; PHASE 3: Local computation. Every processor selects two elements stored in its local memory, compares them and changes its internal status.
We begin by proving the following lemma. To complete the algorithm at the end of T time units, c T must be less than or equal to one. Therefore, 2 log n 2 T (log 35 + log T) must hold. In turn, this implies
It is worth mentioning that Lemma 3.2 implies a similar lower bound for the task of selection in monotone matrices. To see this, note that given an arbitrary matrix A of size n n we can construct a monotone matrix A of size n (n + 1) by simply adjoining to A a column vector of all of whose entries are -. It is now clear that the minimum item in A is precisely the (n + 1)th smallest item in A. Thus, we have the following result. LEMMA 3.3. Every algorithm that selects the kth smallest item in a monotone matrix of size n n requires W(log log n) time.
Previously, Hao et al. [10] have obtained an W(log log n) lower bound for selection in arbitrary matrices on finite reconfigurable meshes. Thus, Lemma 3.3 extends the result of [10] in two directions: First it shows that W(log log n) remains the lower bound for selection on monotone matrices and, second, it shows that the lower bound must hold even for infinite reconfigurable meshes.
LEMMA 3.4. Every algorithm that solves Problem 2 requires
W(log log n) time.
PROOF. Suppose, to the contrary, that Problem 2 requires o(log log n) time However, by using the algorithm of Proposition 4.1 in Section 4, the minimum in the matrix can be computed in O (1) PROOF. Since there is an algorithm that solves Problem 3 in O(log log n) time (see Section 5), we can assume that the upper bound for Problem 3 is O(log log n). Assume that a row-minima algorithm spent t -1 time and has found no row-minima so far and, now, it is about to execute Phase 3 of time unit t, where t < ⑀ log log n for some small fixed ⑀ > 0. Proceeding as in the proof of Lemma 3.2, we see that at most 17n Assume that the topmost row was assigned at most By applying the logarithm, we have log log log log log log log log log log log . Hence, for some small fixed ⑀ > 0, c ⑀ log log n > 1 for large n. Therefore, at least n tn 
PRELIMINARIES
Data movement operations are central to many efficient algorithms for parallel machines constructed as interconnection networks of processors. The purpose of this section is to review a number of basic data movement techniques for basic reconfigurable meshes. Consider a sequence of n items a 1 , a 2 , ¤, a n . We are interested in computing the prefix maxima z 1 , z 2 , ¤, z n , defined for every j, (1 j n), by setting z j = max{a 1 , a 2 , ¤, a j }. Recently, Olariu et al. [23] showed that the task of computing the prefix maxima of a sequence of n numbers stored in the first row of a reconfigurable mesh of size m n can be solved in O(log n) time if m = 1, and in O n m log log 4 9 time if 2 m n. Since their algorithm is crucial for understanding our algorithm for computing the row minima of a monotone matrix, we now present an adaptation of the algorithm in [23] for the BRM. To begin, we exhibit an O(1) time algorithm for computing the prefix maxima of n items on a BRM of size n n. The idea of this first algorithm involves checking, for all j (1 j n), whether a j is the maximum of a 1 , a 2 , ¤, a j . The details are spelled out by the following sequence of steps. The reader is referred to Figs. 3a, 3b, 3c, 3d, 3e, 3f , where the algorithm is illustrated on the input sequence 7, 3, 8, 6.
Algorithm Prefix-Maxima-1; STEP 1. Establish a vertical bus in every column j (1 j n -1) from P(1, j) to P(n + 1 -j, j); every processor P(1, j) (1 j n -1) broadcasts the item a j southbound along the vertical bus in column j; STEP 2. Establish a horizontal bus in every row i (1 i n -1) from P(i, n + 1 -i) to P(i, 1); every processor P(i, n + 1 -i) (1 i n -1) broadcasts the item a n+1-i westbound along the horizontal bus in row i; STEP 3. At the end of Step 2, every processor P(i, j) (i + j n + 1) stores the items a n+1-i and a j ; every processor P(i, j) The correctness of the algorithm above is easily seen. Thus, we have the following result.
PROPOSITION 4.1. The prefix maxima of n items from a totally ordered universe stored one item per processor in the first row of a basic reconfigurable mesh of size n n can be computed in O(1) time.
Next, following [23] , we briefly sketch the idea involved in computing the prefix maxima of n items a 1 , a 2 , ¤, a n on a BRM of size m n with 2 m n. For later reference we now solve a particular instance of the row-minima problem, that we call the selective row minima problem. Consider an arbitrary matrix A of size K N stored, one item per processor, in K consecutive rows of a BRM of size M N. For simplicity of exposition, we assume that A is stored in the first K rows of the platform, but this is not essential. The goal is to compute the minima in rows 1 12
We proceed as follows: Fig. 4 ; further partition each submesh R i ( 
2 K be the minima in the first row of 
THE ALGORITHM
The goal of this section is to present the details of an efficient algorithm for computing the row-minima of an m n monotone matrix A. The matrix is assumed pretiled one item per processor onto a BRM 5 of the same size, such that for
We begin by stating a few technical results that will come in handy later on. To begin, consider a subset i 1 , i 2 , ¤, i p of the rows of A and let j(i 1 ), j(i 2 ), ¤, j(i p ) be such that, for all k (1 k p) A(i k , j(i k )) is the minimum in row r k . Since the matrix A is monotone, we must have
Let A 1 , A 2 , ¤, A p be the submatrices of A defined as follows:
• The following result will be used again and again in the remainder of this section. A perfectly similar argument shows that A 1 and A p are also monotone, completing the proof of the lemma. o
The matrices A k (1 k p) defined above pairwise share a column. The following technical result shows that one can always transform these matrices such that they involve distinct columns. For this purpose, consider the matrix A k obtained from A k by replacing for every i,
and by dropping column j(i k ). In other words, A k is obtained from A k by retaining the minimum values in its first and next column and then removing the last column. The last matrix A p is taken to be A p . The following result, whose proof is omitted, will be used implicitly in our algorithm.
LEMMA 5.2. Every matrix
In outline, our algorithm for computing the row-minima of a monotone matrix proceeds as follows. First, we solve an instance of the selective row minima whose result is used to partition the original matrix into a number of monotone matrices, as described in Lemmas 5.1 and 5.2. This process is continued until the row minima in each of the resulting matrices can be solved directly. If m = 1, then the problem has a trivial solution running in Q(log n) time, which is also best possible even on the more powerful reconfigurable mesh [23] .
We shall, therefore, assume that m 2. 
CONCLUSIONS AND OPEN PROBLEMS
We have shown that the problem of computing the rowminima of a monotone matrix can be solved efficiently on the basic reconfigurable mesh (BRM)-a weaker variant of the recently proposed Polymorphic Processor Array [16] . some fixed constant ⑀, (0 < ⑀ 1), our algorithm runs in O(log log n) time. One of our main contributions was to propose a number of nontrivial time lower bounds for matrix search problems. These lower bounds hold for general two-dimensional reconfigurable meshes of infinite size, as long as the input is pretiled onto an n n submesh thereof. Specifically, in this context we show that every algorithm that solves the problem of computing the smallest item of an n n matrix, or the smallest item in each row of an n n matrix, must take W(log log n) time. This result implies an W(log log n) time lower bound for the problem of selecting the kth smallest item in a monotone matrix, extending the result of [10] in two directions: We show that the same lower bound applies to selection on monotone matrices and on a reconfigurable mesh of an infinite size. Finally, we showed an W log log n 3 8
time lower bound for the task of computing the row minima of a monotone n n matrix. These are the first nontrivial lower bounds of this kind known to the authors.
A number of problems remain open. First, there is a discrepancy between the time lower bound we obtained for the task of computing the row-minima of a monotone matrix and the upper bound provided by our algorithm. Narrowing this gap will be a hard problem that we leave for future research. Second, no nontrivial lower bounds for the problem of computing the row-minima of a totally monotone matrix are known to us. This promises to be an exciting area for future research. Yet another problem of interest would be to solve the row-minima problem for the special case of totally monotone matrices. Trivially, our algorithm for monotone matrices also works for totally monotone ones. Unfortunately, to date, we have not been able to find a nontrivial lower bound for this problem.
