Sorting is a fundamental problem with applications in all areas of computer science and engineering. In this work we address the problem of sorting on mesh connected computers enhanced by endowing each row and each column with its own dedicated high-speed bus. This architecture, commonly referred to as a mesh with multiple broadcasting, is commercially available and has been adopted by the DAP family of multiprocessors. Somewhat surprisingly, the problem of sorting m, (m n), elements on a mesh with multiple broadcasting of size p n p n has been studied, thus far, only in the sparse case, where m 2 ( p n) and in the dense case, where m 2 (n). Yet, many applications require using an existing platform of size p n p n for sorting m elements, with p n < m n.
Introduction
With the tremendous advances in VLSI, it is technologically feasible and economically viable to build parallel machines featuring tens of thousands of processors 4, 22, 39, 40, 42, 47] . However, practice indicates that this increase in raw computational power does not always translate into performance increases of the same order of magnitude. The reason seems to be twofold: rst, not all problems are known to admit e cient parallel solutions; second, parallel computation requires interprocessor communication which often acts as a bottleneck in parallel machines.
In this context, the mesh has emerged as one of the platforms of choice for solving problems in image processing, computer vision, pattern recognition, robotics, and computational morphology, with the number of application domains that bene t from this simple and intuitive architecture growing by the day 2, 4, 39, 40, 42] . Its regular and intuitive topology makes the mesh eminently suitable for VLSI implementation, with several models built over the years. Examples include the ILLIAC V, the STARAN, the MPP, and the MasPar, among many others 4, 5, 11, 39] . Yet, the mesh is not for everyone: its large computational diameter makes the mesh notoriously slow in contexts where the computation involves global data movement.
To address this shortcoming, the mesh has been enhanced by the addition of various types of bus systems 1, 2, 16, 18, 20, 23, 39, 43] . Early solutions involving the addition of one or more global buses shared by all the processors in the mesh, have been implemented on a number of massively parallel machines 2, 4, 19, 39] . Yet another popular way of enhancing the mesh architecture involves endowing every row with its own bus. The resulting architecture is referred to as mesh with row buses and has received a good deal of attention in the literature. Recently, a more powerful architecture has been obtained by adding one bus to every row and to every column in the mesh, as illustrated in Figure 1 . In 20] an abstraction of such a system is referred to as mesh with multiple broadcasting, (MMB, for short). The MMB has been implemented in VLSI and is commercially available in the DAP family of multicomputers 20, 38, 41] . In turn, due to its commercial availability, the MMB has attracted a great deal of attention. Applications ranging from image processing 21, 38, 41] , to visibility and robotics 7, 14, 15, 37] , to pattern recognition 9, 10, 20, 25, 37] , to optimization 17], to query processing 9, 13], and to other fundamental problems 3, 6, 8, 9, 10, 16] have found e cient solutions on this platform and some of its variants 18, 25, 29, 30] .
As we shall discuss in Section 2, we assume that the MMB communicates with the outside world via I/O ports placed along the leftmost column of the platform. This is consistent with the view that enhanced meshes can serve as fast dedicated coprocessors for present-day computers 28]. In such a scenario, the host computer passes data, in a systolic fashion, to the enhanced mesh in batches of p n and so, in the presence of m elements to be processed, the leftmost m p n columns will be used to store the input. From now on, we will not be concerned with I/O issues, assuming that the input has been pretiled in the leftmost m p n columns of the platform.
Sorting is unquestionably one of the most extensively investigated topics in computer science. Somewhat surprisingly, thus far, the problem of sorting m, (m n), elements on an MMB of size p n p n has been addressed only in the sparse case, where m 2 ( p n) or in the dense case, when m 2 (n). For the sparse case, Olariu et al. 35 ] have proposed a time-optimal (log n) time algorithm to sort p n elements stored in one row of an MMB of size p n p n. Yet, many applications require using an existing platform of size p n p n for sorting m elements, with p n < m n. For example, in automated visual inspection one is interested in computing a similarity measure coe cient for a vertical strip of the image 27]. A similar situation arises in the task of tracking a mobile target across a series of frames 31, 46] . In this latter case, the scene consists typically of static objects and one is interested only in evaluating movement-related parameters of the target. In order to perform this task in real-time it is crucial to use the existing platform to process (sort, in particular) parameter values at pixels in a narrow rectangular subimage only. The width of the subimage of interest depends, of course, on the speed with which the target moves across the domain. Therefore, in order to be able to meet read-time requirements one has to use adaptive sorting algorithms that are as fast as possible. The main contribution of this paper is to propose the rst know adaptive time-and VLSI-optimal sorting algorithm on the MMB. Speci cally, we show that once we x a constant 0 < 1 2 , we can sort m; n 1 2 + m n, elements pretiled in the leftmost m p n columns of an MMB of size p n p n in O( m p n ) time. We show that this is time-optimal for this architecture. It is also easy to see that this achieves the AT 2 lower bound in the word model. At the heart of our algorithms lies a novel deterministic sampling scheme reminiscent of the one developed recently by Olariu and Schwing 36] . The main feature of our sampling scheme is that, when used for bucket sorting, the resulting buckets are well balanced, making costly rebalancing unnecessary.
The remainder of this paper is organized as follows: Section 2 discusses the model of computation used throughout the paper; Section 3 presents our time-and VLSI lower bound arguments; Section 4 reviews a number of basic data movement results; Section 5 presents the details of our optimal sorting algorithm. Finally, Section 6 summarizes the results and poses a number of open problems.
The Model of Computation
An MMB of size M N, hereafter referred to as a mesh when no confusion is possible, consists of MN identical processors positioned on a rectangular array overlaid with a highspeed bus system. In every row of the mesh the processors are connected to a horizontal bus; similarly, in every column the processors are connected to a vertical bus as illustrated in Figure 1 . We assume that the processors in the leftmost column serve as I/O ports, as illustrated, and that this is the only way the MMB can communicate with the outside world. Processor P(i; j) is located in row i and column j, (1 i M; 1 j N), with P(1; 1) in the northwest corner of the mesh. Each processor P(i; j) has local links to its neighbors P(i?1; j), P(i+1; j), P(i; j?1), and P(i; j+1), provided they exist. Throughout this paper, we assume that the MMB operates in SIMD mode: in each time unit, the same instruction is broadcast to all processors, which execute it and wait for the next instruction. Each processor is assumed to know its own coordinates within the mesh and to have a constant number of registers of size O(log MN); in unit time, the processors perform some arithmetic or boolean operation, communicate with one of their neighbors using a local link, broadcast a value on a bus, or read a value from a speci ed bus. Each of these operations involves handling at most O(log MN) bits of information.
Due to physical constraints, only one processor is allowed to broadcast on a given bus at any one time. By contrast, all the processors on the bus can simultaneously read the value being broadcast. In accord with other researchers 3, 16, 17, 20, 21, 23, 38, 39] , we assume constant broadcast delay. Although inexact, experiments with the DAP, the PPA, and the YUPPIE multiprocessor array systems seem to indicate that this is a reasonable working hypothesis 23, 26, 29, 38, 39] .
The Lower Bounds
The purpose of this section is to show that every algorithm that sorts m; m n, elements T is the running time. Thus, in this case, the time lower bound of ( m p n ) also translates into VLSI-optimality.
In the remaining part of the paper, we show that the lower bounds derived above are tight by providing an algorithm with a matching running time. For sake of better understanding, we rst present a preliminary discussion on some data movement techniques used throughout the paper.
Data Movement
Data movement operations constitute the basic building blocks that lay the foundations of many e cient algorithms for parallel machines constructed as an interconnection network of processors. The purpose of this section is to review a number of data movement techniques for the MMB that will be instrumental in the design of our sorting algorithm.
Merging two sorted sequences is one of the fundamental operations in computer science. Olariu et al. 35 ] have proposed a constant time algorithm to merge two sorted sequences of total length p n stored in one row of an MMB of size p n p n. More Since merging is an important ingredient in our algorithm, we now give the details of the merging algorithm 35]. To begin, using vertical buses, the rst row is replicated in all rows of the mesh. Next, in every row i, (1 i r), processor P(i; i) broadcasts a i horizontally on the corresponding row bus. It is easy to see that for every i, a unique processor P(i; j), (r + 1 j p n), will nd that b j?1 < a i b j . Clearly, this unique processor can now use the horizontal bus to broadcast j back to P(i; i). In turn, this processor has enough information to compute the position of a i in the sorted sequence. In exactly the same way, the position of every b j in the sorted sequence can be computed in O(1) time. Knowing their positions in the sorted sequence, the elements can be moved to their nal positions in O(1) time.
Next, we consider the problem of merging multiple sorted sequences with a common length. Let a sequence of p n elements a 1 , a 2 , : : :, a p n be stored, one per processor, in the rst row of an MMB of size p n p n. Suppose that the sequence consists of k sorted subsequences and each subsequence consists of p n k consecutive elements of the original sequence. The goal is to sort the entire sequence.
For de niteness, we assume that subsequence j, (1 j k), contains the elements a (j?1) p n k +1 , : : :, a j p n k . Our sorting algorithm proceeds by merging the subsequences two at a time into longer and longer subsequences. The details are as follows. We set aside submeshes of size We now view the mesh as consisting of horizontal submeshes R 1 ; R 2 ; : : : ; R x , each of size p n x p n. In a submesh R p , (1 p x), each processor P(l; j), (1 j p n; l = d j x e), broadcasts its value along column bus j and P(j; j) records it as shown in Figure 3 (c). Again, in constant time, each processor P(j; j) broadcasts its value along row bus j to processor P(p; j). The above can be repeated for each submesh R p with 1 p x, thus accomplishing the required data movement in O(x) time. To summarize our ndings we state the following result.
Lemma 4.3. Given a mesh with multiple broadcasting of size p n p n with input elements stored in the leftmost x columns in sorted row-major order, the data can be moved into sorted column-major order in the leftmost x columns in O(x) time. 
The Algorithm
We are now in a position to present our time-and VLSI-optimal sorting algorithm for the MMB. Essentially, our algorithm implements the well-known bucket sort strategy. The novelty of our approach resides in the way we de ne the buckets, ensuring that no bucket is overly full. Throughout, we assume an MMB R of size p n p n. avoid tedious, but otherwise inconsequential, details we assume that m p n is an integer. The goal is to sort these elements in column-major order, so that they can be output from the mesh in O( m p n ) time. We propose to show that with the above assumptions the entire task of sorting can be performed in O( m p n ) time. Thus, from our discussion in Section 3, we can conclude that our algorithm is both time-and VLSI-optimal. It is worth mentioning yet another interesting feature of our algorithm, namely, that the time to input the data, the time to sort, and the time to output the data are essentially the same.
To make the presentation more transparent and easier to follow we refer to the submesh consisting of the leftmost m p n columns of R as M. In other words, M is the submesh that initially contains the input. Further, a slice of size k of the input consists of the elements stored in k consecutive rows of M.
We will rst present an outline of our algorithm and then proceed with the details. We start by partitioning M into slices of size m p n and sort the elements in each such slice in row-major in O( m p n ) time using an optimal sorting algorithm for meshes 32, 45] . Next, we use bucketsort to merge consecutive m p n of these into slices of size ( m p n ) 2 sorted in row-major order. Using the same strategy, these slices are again merged into larger slices sorted in row-major order. We continue the merging process until we are left with one slice of size p n, sorted in row-major order. Finally, employing the data movement discussed in Lemma 4.3, the data is converted into column-major order.
We We refer to submeshes R k;k as diagonal { see Figure 4 for an illustration. Notice that the diagonal submeshes can be viewed as independent MMBs, since the same task can be performed, in parallel, in all of them without broadcasting con ict. The algorithm begins by moving the elements in every R k;1 to the diagonal submesh R k;k . This can be accomplished, column by column, in O( m p n ) time. We now present the details of the processing that takes place in parallel in every diagonal submesh R k;k . The rightmost element in every row of R k;k will be referred to as the leader of that row as shown in Figure 4 . To begin, the sequence of leaders q 1 ; q 2 ; : : : ; q ( m p n ) i+1 in R k;k is sorted in increasing order. Note that by virtue of our grouping, the sequence of leaders consists By de nition, the leaders a(j?1)m p n +1 through ajm p n belong to bucket B j . This observation motivates us to call a row in R k;k regular with respect to bucket B j if its leader belongs to B j . Similarly, a row of R k;k is said to be special with respect to bucket B j if its leader belongs to a bucket B t with t > j, while the leader of the previous row belongs to a bucket B s with s j. To handle the boundary case, we also say that a row is special with respect to B j if it is the rst row in a slice and its leader belongs to B t with t > j. Note that, all elements in B j must be in either regular rows or special rows with respect to B j .
At this point, we make a key observation. Notice that as a result of the previous data movement operations, each processor in Q j;j holds at most two elements: one from a regular row with respect to B j and one from a special row. Next, we sort the elements in each submesh Q j;j in overlaid row-major order. In case the number of elements in Q j;j does not exceed m 2 n , after sorting the elements can be placed one per processor. If the number of elements exceeds m 2 n , the rst m 2 n of them are said to belong to generation-1 and the remaining elements are said to belong to generation-2. The elements belonging to generation-1 are stored one per processor in row-major order, overlaid with those from generation-2, also in row-major order.
This task is performed as follows. Using one of the optimal sorting algorithm for meshes It is easy to con rm that the recurrence describing the behavior of T(i + 1) is
T(1) = m p n ;
The algorithm terminates at the end of t iterations, when m p n t+1 = p n: 
Conclusions and Open Problems
The mesh-connected computer architecture has emerged as one of the most natural choices for solving a large number of computational tasks in image processing, computational geometry, and computer vision. Its regular structure and simple interconnection topology makes the mesh particularly well suited for VLSI implementation. However, due to its large communication diameter, the mesh tends to be slow when it comes to handling data transfer operations over long distances. In an attempt to overcome this problem, mesh-connected computers have been augmented by the addition of various types of bus systems. Among these, the mesh with multiple broadcasting (MMB) is of a particular interest being commercially available, being the underlying architecture of the DAP family of multiprocessors.
The main contribution of this paper is to present the rst known adaptive time-and VLSI-optimal sorting algorithm for the MMB. Speci cally, we have shown that once we x a constant 0 < A number of problems remain open. First, it would be of interest to see whether the bucketing technique used in this paper can be applied to the problem of selection. To this day, no time-optimal selection algorithms for meshes with multiple broadcasting are known. Also, it is not known whether the technique used in this paper can be extended to meshes enhanced by the addition of k global buses 1, 12] . Further, we would like to completely resolve these issues concerning optimal sorting over the entire range p n m n. Note that the results of Lin and others 24] show that for m near p n, (log n) is the time lower bound for sorting in this architecture. Their results imply that a sorting algorithm cannot be VLSI-optimal for m near p n.
Quite recently, Lin et al. 24 ] proposed a novel VLSI architecture for digital geometry { the Mesh with Hybrid Buses (MHB) obtained by enhancing the MMB with precharged 1-bit row and column buses. It would be interesting to see whether the techniques used in this paper extend to the MHB. This promises to be an exciting area for future work.
