any given time a message to all the other processors sharing the same subbus.
INTRODUCTION
Reconfigurable networks have received much attention in the last few years, due to technological developments that allowed some experimental and commercial reconfigurable chips with thousands of processors to be built [2, 9] .
In a reconfigurable network, each node can dynamically connect and disconnect its adjacent edges in various patterns. Specifically, each node of the network consists of a processing unit, a small local memory, and a switch, while each edge is viewed as a building block for a larger bus. Each switch has some I/O ports and each port is directly connected to at most one edge. While the edges outside the switch are fixed, the internal connections between the I/O ports of each switch can be locally configured by the processor itself into any combination of pairs and singletons. In this way, during the execution of an algorithm, the edges of the network are dynamically partitioned into edge-disjoint paths. Every such path forms a subbus, and allows (only) one processor of the subbus to broadcast at indices in the array. It is also assumed that switch reconfiguration can be done in O(1) time by local decisions taken by the processors themselves. It is worth noting that, under the assumptions that processors, switches, and edges occupy O(1) space, the reconfigurable mesh can be laid out on a rectangle of O( pq) area in the VLSI grid model [9] .
The remainder of this section shows how a reconfigurable mesh can be used to efficiently solve a merging problem: namely, given two sorted sequences X ϭ x 0 ... x pϪ1 and Y ϭ y 0 ... y qϪ1 , find a sorted sequence Z ϭ z 0 ... z pϩqϪ1 containing exactly every element of X and every element of Y. It is assumed that the elements to be merged are drawn from an ordered set of finite size and that each element is represented by O(1) bits. For the sake of simplicity, it is assumed that three additional elements x p , y q , and z pϩq are appended, respectively, to the ends of the sequences X, Y, and Z. These elements all have a bigger value, say ϩȍ, than any other element in X and Y.
To solve the merging problem, a ( p ϩ 1) ϫ (q ϩ 1) mesh is employed such that the generic processor P ij at the ith row and the jth column of the mesh holds x i and y j , 0 Յ i Յ p and 0 Յ j Յ q. DEFINITION 2.1. Given a ( p ϩ 1) ϫ (q ϩ 1) mesh, the kth antidiagonal is the set of processors
.. x pϪ1 and Y ϭ y 0 ... y qϪ1 be two sorted sequences to be merged. A processor P ij of a (p ϩ 1) ϫ (q ϩ 1) mesh is called active when:
See, for example, Fig. 2 .
mesh contains exactly one active processor P ij , and z k ϭ min{x i , y j }.
Proof. The proof is by induction on p ϩ q. When p ϩ q ϭ 0, the basis of the induction holds true. Indeed, x 0 ϭ y 0 ϭ ϩȍ, a 1 ϫ 1 mesh is used, P 00 is obviously active, and z 0 ϭ min{ϩȍ, ϩȍ}.
Assume the lemma is true for ͉X ͉ ϩ ͉Y͉ Ͻ p ϩ q, and let ͉X ͉ ϭ p and ͉Y ͉ ϭ q. Without loss of generality, assume that x 0 Ͼ y 0 (the case x 0 Յ y 0 can be proved in a similar way, by swapping rows and columns). In this case, the processors of the ( p ϩ 1) ϫ (q ϩ 1) mesh can be partitioned into three submeshes P 00 , v, and S as shown in Fig. 3 (note that the submesh v can be empty).
By definition, P 00 is active and the processors in the submesh v are inactive. Since x 0 Ͼ y 0 , the merged sequence Z is such that z 0 ϭ y 0 and ZЈ ϭ z 1 ... z pϩqϪ1 is given by merging X ϭ x 0 ... x pϪ1 and YЈ ϭ y 1 ... y qϪ1 . Consider now of the related dictionary data structure have been proposed in the literature (e.g., see [11, 14] ), to our knowledge a VLSI implementation of a priority queue is given for the first time in the present paper.
Briefly, the remainder of the paper is organized as follows. In Section 2 the reconfigurable mesh architecture is reviewed and a very simple and efficient way of merging two sorted sequences on a reconfigurable mesh is presented. Such a merging algorithm will be used in the next sections for the implementation of the priority queue operations. In Section 3 the reconfigurable tree of meshes architecture is defined and its H-shaped layout in the VLSI grid model is shown. Section 4 shows how to simulate a heap implementation for the P-bandwidth priority queue by means of the reconfigurable tree of meshes architecture previously introduced. For the sake of simplicity, the three operations for the priority queue are first implemented by means of three distinct networks, each capable of performing only one operation. Subsequently, it is shown how to perform all the operations on the same network. Finally, concluding remarks terminate the paper in Section 5.
MERGING ON A RECONFIGURABLE MESH
A reconfigurable mesh of size p ϫ q consists of a classical p ϫ q processor array with additional reconfigurable capabilities [2, 9, 15] . Specifically, there are p rows and q columns of nodes, with edges connecting each node to its four neighbors (or fewer, for borderline nodes). Each edge is a building block for a larger bus, while each node has a switch with four I/O ports (E, W, N, S), which can be configured in 10 possible ways, as shown in Fig. 1 . Each node also has a processor, capable of performing the basic arithmetic and logic operations, and a small local memory. Processors operate in a single-instruction multiple-data (SIMD) mode, and only one processor can broadcast at any time to a subbus shared by multiple processors.
The word model is assumed, where each local memory has a size of ⌰(log s) bits and each bus can carry ⌰(log s) bits of data, where s is the size of the network. However, if keys are of constant length, then the priority queue works on the bit model too, where each local memory has a size of O(1) bits and each bus can carry O(1) bits of data; note that this implies that the processors do not know their
contains the active processor P 00 only and, since z 0 ϭ y 0 ϭ min{x 0 , y 0 }, the proof follows. If k Ͼ 0, then A k contains the inactive processor P k0 of the submesh v, and all the processors in the (k Ϫ 1)th antidiagonal of the submesh S. The status (active/inactive) of the processors in S is that obtained by merging X and YЈ on the ( p ϩ 1) ϫ q submesh S. By inductive hypothesis, since ͉X ͉ ϩ ͉YЈ͉ Ͻ p ϩ q, the (k Ϫ 1)th antidiagonal of S has exactly one active processor and the minimum between the two elements in such a processor is equal to the (k Ϫ 1)th element of Z'. Therefore such a minimum is also the kth element of Z, and the lemma is proved.
The next lemma refers to a well-known technique for finding the leftmost or rightmost 1 (or 0) in a sequence of bits. This has been used before in Ref. [9] and others in the context of finding the OR, and will be useful in the remainder of this paper, too. Obviously, the sequence of bits could be replaced by a sequence of boolean values that can be locally calculated by each processor in constant time. Proof. Assume that the p ϩ 1 (q ϩ 1) values x 0 ... x p ( y 0 ... y q ) are stored one per processor in column (row) zero. Configure all the switches EW, NS so as to have one subbus per column and one subbus per row. Then every processor P i0 (P 0j ) broadcasts its value x i ( y j ) on its subbus, after which every processor P ij knows the values x i and y j .
Successively, the status (active/inactive) of each processor has to be set. By Definition 2.2, P ij is inactive if and only if at least one of the following two conditions is true:
Consider row i of the mesh, and let h i ϭ min{ j : x i Յ y j }. Note that h i is well defined, since y q ϭ ϩȍ. Processors P i0 , ... , P ih i do not satisfy condition (1), and hence can be either active or inactive, whereas processors P ih i ϩ1 , ... , P iq are surely inactive. To set such processors inactive, h i has to be found; this can be done applying Lemma 2.4 to row i of the mesh. By performing the above procedure in parallel on all the rows, it is possible to set all the processors which are inactive because of condition (1) . In a similar way, considering column j and letting h j ϭ min{i : x i Ͼ y j or x i ϭ ϩȍ} (so that h j be well defined), it is possible to set all the processors which are inactive because of condition (2) . In this way all the inactive processors are detected, and the remaining processors can thus be set active.
Finally, configure all the switches ES, NW to have a subbus per antidiagonal, as shown in Fig. 4 . Each active processor, say P ij , of each antidiagonal A k , broadcasts z k ϭ min{x i , y j } on the subbus. Thus the merged sequence Z ϭ z 0 ... z pϩqϪ1 is available on the borderline processors, one element per processor in column 0 and row p (or, equivalently, in row 0 and column q).
Since the above steps require broadcasts on subbuses which are either O( p) or O(q) long, the overall computa- reviewed in Section 2 will be used for processors within a node of the tree. It is worth noting that no processor needs to know the level or the heap number of the tree node to which it belongs. It is only assumed that a processor knows whether it belongs to the root, and whether it is in the last row of the upper submesh or in the lower submesh, by means of properly set bits. The proposed reconfigurable tree of meshes architecture can be laid out in the VLSI grid model. Indeed, it is well known that a complete binary tree of n nodes can be laid out in an H-shaped manner to occupy O(n) area, provided that nodes have O(1) area and links are O(1) wide. The proposed tree of meshes can be seen as a complete binary tree in which nodes have O( P 2 ) area and links can carry O( P) keys. Therefore, it is easy to see that an H-shaped layout of the reconfigurable tree of meshes requires O(nP 2 ) area.
HEAP IMPLEMENTATION ON A RECONFIGURABLE TREE OF MESHES
This section gives a heap implementation of a P-bandwidth priority queue on a reconfigurable tree of meshes (e.g. see [4] for a sequential heap implementation when P ϭ 1 and [13] for a PRAM implementation when P Ͼ 1). Each node of the tree either is empty or stores P keys, in nonincreasing order, in the first column of the upper submesh. The root, if nonempty, stores its P keys also in the first row, in order to be ready to transmit them to the external world. The keys are stored in the nodes so as to maintain the following heap properties:
(a) if the node N l,i is empty, then every node N k,h with k Ն l and h Ͼ i is also empty; (b) all the keys stored in a node are greater than or equal to the keys stored in its parent node.
Clearly, the P smallest keys are stored in the root, and an error message must be provided when a DELETEMIN (INSERT(X )) operation is required on an empty ( full) heap. For the sake of simplicity, the operations MIN, DELETEMIN, and INSERT(X ) are first implemented by means of three distinct networks, each capable of performing only one operation. Successively, it is shown how to perform all the operations on the same network. In the implementation, the insertion node knows its status by means of a bit properly set in each processor of the node. Similarly, each node in the insertion path knows its status and the switch configuration to connect itself to the next node in the path. In this way, a subbus can be created along the insertion path in O(1)
THE RECONFIGURABLE TREE OF MESHES ARCHITECTURE
The proposed parallel implementation of a P-bandwidth priority queue is based on a reconfigurable tree of meshes architecture. Such an architecture consists of a complete binary tree of n ϭ 2 L Ϫ 1 nodes, each of which is a reconfigurable mesh of size 2( P ϩ 1) ϫ ( P ϩ 1).
As shown in Fig. 5 , the generic node of the tree consists of two submeshes. The upper submesh (from row 0 to row P) is a reconfigurable mesh of ( P ϩ 1) 2 processors, has the purpose of storing and processing keys, and is connected to its parent node. In particular, processors at row 0 of the root can communicate with the external world. The lower submesh (from row P ϩ 1 to row 2P ϩ 1), instead, has only the P ϩ 1 processors on the main diagonal, which allow to communicate either with the left or with the right son, depending on the configurations of its switches. More precisely, when the P ϩ 1 switches are all configured NW (EN), then the node is connected to its left (right) son node, while when they are all configured EW, then the two son nodes are directly connected between them. The above configurations, along with all switches configured O, are the only feasible configurations for the lower submesh and will be denoted, respectively, PL, PR, LR, and O. Thus, for the sake of conciseness, we shall talk about ''nodes configured PL, PR, LR, or O,'' instead of ''all the P ϩ 1 switches of the lower submesh configured NW, EN, EW, or O.'' Moreover, for ease of reference, a generic node of the tree will be denoted in the paper by N l,h , where l is its level (0 Յ l Յ L Ϫ 1) and h its heap number (1 Յ h Յ n). In this way, the root of the tree is N 0,1 , the left (right) son of node N l,i is N lϩ1,2i (N lϩ1,2iϩ1 ), and the rightmost leaf is N LϪ1,n . However, the usual mesh notation empty (full) heap is checked after each DELETEMIN (INSERT(X )) operation and the resulting information is maintained by the root. Moreover, each node knows whether it stores keys or not and, in the former case, is configured O. When the computation starts, the heap is empty, and the root is the insertion node. Proof. The root transmits to the external world either an error message (if it is the insertion node) or the P keys stored in its first row (if it is not the insertion node). The time required is obviously O(1). Finally, when i ϭ 2 lϩ1 Ϫ 1 and l ϭ L Ϫ 1, the same actions as in the case i ϭ 2 lϩ1 Ϫ 1 and l Ͻ L Ϫ 1 are performed. In the last step, however, there is no node configured O which receives the signal broadcast by the root, since the heap is full. This situation is detected by the root, and the insertion path is not changed.
The time complexity of the above algorithm is due to broadcasts on subbuses which are O( P log n) long. Therefore it is O(1), using the unit-time delay model, and O(log P ϩ log log n), using the log-time delay model. Proof. See, for example, the algorithm shown in [10] . Note that the algorithm of [10] works on the bit-model, too, with minor and straightforward changes. Configure the switches to build the insertion path. Let l be the level of the insertion node. It is possible to consider a big (l ϩ 1)P ϫ (P ϩ 1) submesh by selecting the processors in the first P rows belonging to the upper submeshes of the nodes in the insertion path. Note that this ''big submesh'' is distributed over the tree of meshes; i.e., it does not consist of adjacent processors. In this way, merging can be performed between the P keys at row 0 of the root, and the lP keys at column 0 of the big submesh. An example is provided in Fig. 7 , where active and inactive processors of the big submesh are white and shaded, respectively. By Theorem 2.5, this requires O(1) time, using the unit-time delay model, and O(log P ϩ log log n) time, using the log-time delay model, since the submesh has size A 1 and M 2 such that M 1 belongs to the path.
Observe that a path of minima always exists but is not unique. In the following implementation of the DE-LETEMIN operation, it is sufficient to detect any path of minima. Proof. Each node either transmits to its parent the maximum key it stores or communicates to its parent the fact that it is empty. Such a procedure can be performed in two phases, one for the left sons, and the other for the right sons. The parents keep track of the son containing the maximum key, merge the 2P keys coming from its sons, and rearrange these keys into the sons. Finally, each parent configures its switch to connect itself to the proper son. The overall time complexity is dominated by that of merging O(P) keys and therefore is O(1), using the unittime delay model, and O(log P), using the log-time delay model. Proof. Configure the switches so as to build the insertion path and move the P keys from the insertion node to the processors at row 0 and column 0 of the root. In this way, N INS becomes empty and the P smallest keys are lost. According to Lemma 4.10, update prev(N INS ). If N INS ϭ N 0,1 , then the root keeps track that the heap is now empty. Otherwise, in order to maintain property (b) of the heap, configure a subbus along a path of minima and rearrange the keys as described in the proof of Lemma 4.12. In this way, it is possible to move upwards the keys contained in the path of minima without violating property (b) of the heap. Specifically, each node in the path stores the P keys previously contained in its son. Then the root contains in column 0 the P keys coming from its son, and in row 0 the P keys of the old insertion node. Therefore, merging can be performed between the P keys at row 0 of the root, and the keys at column 0 of the nodes in the path of minima, as described in Theorem 4.8. Since all the steps of the above procedure require O(1) time, using the unittime delay model, and O(log P ϩ log log n) time, using the log-time delay model, the proof follows.
For the sake of simplicity, the operations MIN, DE-LETEMIN, and INSERT(X ) were implemented by means of three distinct networks, each capable of performing only one operation. We now show how to perform all the operations on the same network. To do this, proper commands
Observe that the keys in the merged sequence are available one per processor at column 0 of the big submesh and, in particular, the P largest keys correctly go into the first P processors at column 0 of the insertion node.
It is readily seen that properties (a) and (b) of the heap are maintained. Indeed, only the insertion node is no more empty among the previously empty nodes, while the keys stored in each node of the insertion path are smaller than or equal to the previously stored keys. The procedure terminates by updating the insertion node, as described in the proof of Lemma 4.6, during which the root can detect whether the heap is full or not. Since this requires O(1) time, using the unit-time delay model, and O(log P ϩ log log n) time, using the log-time delay model, the proof follows. have to be received by the processors without increasing the overall time complexity. Observe that commands cannot be broadcast by the root to all the remaining nodes, since otherwise the same O(log n) time complexity of the sequential implementation would result. Proof. All the processors in the tree configure the switches according to the following cyclic algorithm:
Step 1. The nodes in the insertion path create a subbus along the path itself, while the remaining nodes configure them PR to perform the actions described in Lemma 4.10.
Step 2. The nodes in the tree create a subbus along the path of minima.
Step 3. The nodes in the insertion path create a subbus along the path itself, while the remaining nodes configure them PL so as to perform the actions described in Lemma 4.6.
Assume that each step of the above algorithm is slow enough to allow a constant number of broadcasts along subbuses of O(P log n) length to be performed, as well as a merging on an insertion path or on a path of minima.
If the root receives a MIN operation code, the root itself performs the operation, as shown in Proposition 4.3, without involving any other node.
If the root receives an INSERT operation code, along with the P keys to be inserted, two cases arise. If the queue is full, the root itself outputs an error message. Otherwise, when Step 1 begins, the root broadcasts to all the processors in the insertion path a proper code, so that such processors perform the merging algorithm of Theorem 4.8. Then, when Step 3 begins, the root broadcasts to all the processors in the insertion path another code, so that the insertion path is changed as described in Lemma 4.6.
Finally, two cases arise also when the root receives a DELETEMIN operation code. If the queue is empty, the root itself outputs an error message. Otherwise, when Step 1 begins, the root broadcasts to all the processors in the insertion path a proper code, in order to update the insertion path as described in Lemma 4.10 and move the P keys from the insertion node to the root. Finally, during Step 2, when the path of minima is created, the merging procedure described in Theorem 4.13 is performed.
It is easy to see that the above actions can be performed so as to take O(1) time for all the priority queue operations, 254 BERTOSSI AND MEI using the unit-time delay model, or O(1) for MIN and O(log P ϩ log log n) for both DELETEMIN and INSERT, using the log-time delay model.
CONCLUSIONS
A parallel implementation of a P-bandwidth priority queue of size nP by a tree of meshes network with reconfigurable buses of O(nP 2 ) processors and O(P log n) maximum subbus length has been presented. It is worth noting that the proposed tree of meshes network can be embedded on an O(Pn 1/2 ) ϫ O(Pn 1/2 ) reconfigurable mesh. However, the maximum subbus length becomes O(Pn 1/2 ), thus increasing the computational time in the log-time delay model. Since reconfigurable meshes are scalable, i.e., can run algorithms for large problem instances on small machines (see Theorem 4.1 of [1] for the LRN model), the parallel implementation of the priority queues proposed in the present paper is scalable as well. In particular, an m ϫ m reconfigurable mesh can simulate the O(Pn Several questions, however, remain open. In particular, it would be interesting either to find implementations with smaller subbus length and/or to prove AT 2 optimality using the log-time delay model. Moreover, it would be interesting as well to find further applications of the reconfigurable tree of meshes proposed in this paper.
