Abstract
We consider randomized simulations of shared memory on a distributed memory machine (DMM) where the n processors and the n memory modules of the DMM are connected via a recon gurable architecture. We rst present a randomized simulation of a CRCW PRAM on a recon gurable DMM having a complete recon gurable interconnection. It guarantees delay O(log n), with high probability.
Next we study a recon gurable mesh DMM (RM-DMM). Here the n processors and n modules are connected via an n n recon gurable mesh. It was already known that an n m recon gurable mesh can simulate in constant time an n-processor CRCW PRAM with shared memory of size m. In this paper we present a randomized step by step 1 Introduction
The parallel random access machine (PRAM) is an idealized model for parallel computation. It strips away problems that result from synchronisation, latency, memory contention, communication capacity and reliability. Therefore, it is very comfortable to design parallel algorithms because the programmer does not have to deal with hardware limitations. But on the other hand it is very unrealistic from the technological point of view. In this paper we deal with the issue of e cient realization of the shared memory of the PRAM. With current technology a parallel shared memory can only be realized for a small number of processors. A more realistic parallel computation model is the distributed memory machine (DMM). Here the memory is distributed among a limited number of memory modules, and the processors and memory modules are connected via a routing interconnection network. In this paper we study DMMs with n processors and n modules.
In an e ort to understand the relative power of the PRAM compared with other parallel computation models several authors described simulations between them (Upfal, 1984; Karlin and Upfal, 1986; Wang and Chen, 1990; Ranade, 1991; Leighton, 1992a; Leighton, 1992b; Karp et al., 1993; Dietzfelbinger and Meyer auf der Heide, 1993; Meyer auf der Heide et al., 1995; Czumaj et al., 1995d) . For example, it is known that the n-processor PRAM can be simulated (wigh high probability) with O(log n) delay on the butter y networks (Ranade, 1991) , with O(log log n) delay on the optical communication parallel computer (Goldberg et al., 1994) and with O(log log log n log n) delay on the DMM with the complete interconnection network between processors and modules (Czumaj et al., 1995c; Czumaj et al., 1995d) .
In recent years interest in parallel computation models based on recon gurable architectures has rapidly grown. As very powerful and physically realizable models of parallel computation (see (Li and Stout, 1991; ElGindy and Prasanna, 1995) ) recon gurable networks have become object of extensive research (Wang and Chen, 1990; Li and Stout, 1991; Olariu et al., 1993; Ben-Asher et al., 1994) . They considered many fundamental operations and problems on this model, especially on the recon gurable mesh, namely, data reduction, ranking, sorting, parity. Ben-Asher et al. (1994) studied the parallel complexity of recon gurable network models. They examined the computational power by focusing on the set of problems computable in constant time on some variants of the model.
In this paper we investigate relations between the PRAM and DMMs with recon gurable networks as routing mechanisms. All results stated in the following are randomized and hold with high probability (w.h.p.), i.e. with probability at least 1 ? n ? for any constant > 1. We focus on simulations that minimize the delay, i.e., the time needed to simulate a parallel memory access of a PRAM on a DMM. Furthermore, we are interested in optimal simulations. We say a simulation of a p-processor PRAM on an n-processor DMM is time-processor optimal if the delay is O(p=n).
The rst, very powerful model we analyze is the recon gurable DMM (R-DMM), where the routing network is a complete recon gurable network. This model can be viewed as a DMM where the processors and modules are connected by a complete graph (so called Standard-DMM) with the additional facility of combining links to buses. In each step of the R-DMM each processor can combine two adjacent links into one and then read from or write into this new link. This de nes us paths and cycles that form buses which can be used for broadcasting.
A further step to achieve a more realistic model is to assume, instead of a complete network, a recon gurable mesh as a routing network. (See Li and Stout (1991) for more motivation behind this model of the recon gurable mesh). The model of the recon gurable mesh DMM (RM-DMM) takes into account the issue of memory contention and assumes a technologically feasible interconnection network between processors and modules. The interconnection network of the n-processor RM-DMM is formed by an n n recon gurable mesh. In a recon gurable mesh each node can combine in each step pairs of adjacent links together such that the combined links create buses. As in the R-DMM, each processor can read from or write to adjacent buses and use them for broadcasting. Wang and Chen (1990) presented a deterministic simulation of a CRCW PRAM with n processors and m shared memory cells by the m n processor array with a recon gurable bus system with constant delay. This result requires a large hardware overhead to simulate a large shared memory. We note, however, that it enables to simulate deterministically an n-processor Standard-DMM on an n-processor RM-DMM with constant delay.
Very recently, indepently to our work, Matias and Schuster (1995) presented a PRAM simulation on an n processor variant of an RM-DMM with O(log log log n) delay using a result from (Czumaj et al., 1995d) . They assume a weaker collision resolution rule for concurrent access of processors to a bus than we do in this paper.
Outline of Results
Our PRAM simulations follow the idea of hashing the shared memory of the PRAM into the modules of the DMM, as in (Karp et al., 1993; Dietzfelbinger and Meyer auf der Heide, 1993; Goldberg et al., 1994; MacKenzie et al., 1994) . One or more copies of the shared memory cells are distributed among the memory modules using a constant number of hash functions. The hash functions are chosen uniformly at random from a high performance log 2 n-universal class of hash functions (see e.g. (Siegel, 1989; Karp et al., 1993; Czumaj et al., 1995d) ). To achieve a consistent simulation, the majority technique due to Upfal and Wigderson (1987) is used. It ensures that it su ces to access always a majority of all copies of a key to get a consistent shared memory simulation.
Our rst result is a simulation of an n-processor CRCW PRAM on an nprocessor recon gurable DMM (R-DMM) with O(log n) delay, w.h.p.. This result compares favourably with the best known PRAM simulation on the Standard-DMM that has delay O(log log log n log n), w.h.p. (Czumaj et al., 1995d) . The simulation by an R-DMM can be made time-processor optimal for EREW PRAMs.
Our second result shows that an n-processor RM-DMM is as powerful as an nprocessor CRCW PRAM. More precisely we present a step by step simulation that performs in real time, i.e. guarantees constant delay for each simulated PRAM step, w.h.p.. Hence we combine the advantages of the simulations of Wang and Chen (1990) and Czumaj et al. (1995d) , and signi cantly improve the result of Matias and Schuster (1995) . The main idea of the simulation is to transfer the O(log n)-delay simulation on the R-DMM to the RM-DMM and then redesign all nonconstant-time steps. The paper is organized as follows. In Section 2 we proceed with the precise de nitions of the computation models. Section 3 presents the general idea of hashing based PRAM simulations, and states two graph theoretic lemmas from (Czumaj et al., 1995d) that are the basis for the analysis of our algorithms. Section 4 gives two algorithmic tools used in the simulations. In Section 5 we present a simulation of a CRCW PRAM on an R-DMM. Finally Section 6 contains the real-time simulation of a CRCW PRAM on an RM-DMM and shows the optimality of this result.
Computation Models
A parallel random access machine (PRAM) consists of p processors P 1 ; : : : ; P p and a shared memory with cells U = f1; : : :; mg. The processors work synchronously and have random access to the shared memory cells, each of which can store an integer. We consider two models of the PRAM, an exclusive read exclusive write (EREW) PRAM, in which concurrent reads and writes are forbidden, and a concurrent read concurrent write (CRCW) PRAM, which allows concurrent reads and writes. Among many variants of the CRCW PRAM model, we only deal with two variants for solving con icts if several processors want to write to the same shared memory cell simultaneously, the Priority CRCW PRAM, in which the processor with the highest priority succeeds and the weaker Arbitrary CRCW PRAM where an arbitrary processor succeeds.
A distributed memory machine (DMM) has n processors Q 1 ; : : : ; Q n connected via a routing network with a distributed memory consisting of n memory modules M 1 ; : : : ; M n (See Figure 1) . A module has a communication window and can read one succeeds. This is the same con ict resolution rule mentioned above for the Arbitrary CRCW PRAM.
If we specify the routing network we can distinguish between several models. If the routing network is a complete network we call the model the Standard-DMM as introduced by Karp et al. (1993) . Note that an n-processor Standard-DMM can be simulated with constant delay on an n-processor Arbitrary CRCW PRAM with O(n) shared memory cells and vice versa.
By adding the capability of recon guration to the complete network as routing network, we get the recon gurable distributed memory machine (R-DMM). Roughly speaking, the capability of recon guration allows a processor to combine two adjacent links to other processors into a bus. Because in the bipartite graph there are no direct links between the processors we identify processor Q i with module M i , for i = 1; : : : ; n. Hence, the complete bipartite graph between the processors and modules can be viewed as a complete network connecting the processor/module pairs. Thus, we can view each processor as having a link to all other processors. Each processor Q l can combine a link to Q k with a link to Q m into a bus. These combined links are viewed as (hardware) connected. Hence, such combined links are building blocks for larger bus components. These buses are restricted to node-disjoint cycles or paths (see Figure 2 ). The R-DMM dynamically recon gures itself at each time step. Each processor of the R-DMM acts locally in each step combining two adjacent links into one. In each step of the R-DMM one or more processors connected by a bus can try to transmit a message on the bus. If more than one do, an arbitrary one succeeds. This is the same Arbitrary write con ict resolution rule as described for the Standard-DMM. All processors connected by the bus can read the message transmitted on the bus. Clearly, this means that it is possible to broadcast information in one step to more than one processor. The basic assumption concerning the behavior of the recon gurable model is that the time to transmit a message along any bus is constant, regardless of the length of the bus.
If we use an n n recon gurable mesh for the topology of the routing network, we get the n-processor recon gurable mesh DMM (RM-DMM). The recon gurable mesh (Li and Stout, 1991) consists of a two dimensional mesh in an n n square grid with one switch per grid point and a recon gurable bus system. Each switch is connected to the recon gurable bus system through four ports, denoted by N, S, W, and E. The con guration of the bus system can be changed by connecting di erent pairs of ports within each switch. Hence, the global recon guration is a partition of the network into edge-disjoint paths and cycles. The computational power of a switch is very limited. It can store one integer and can perform only very simple computations: basic operations on two numbers, change of connections between ports, read from or write to the bus it is connected with.
In the RM-DMM, the n processors are assigned to the rst column and the n memory modules to the rst row of the mesh. An example of a 4-processor RM-DMM is shown in Figure 3 . Each processor and each switch can communicate with other switches and processors by broadcasting a message through the bus. All processors and switches connected with the bus can read the message. If more than one try to send messages on a bus, an arbitrary one of them succeeds. Again, this is the same Arbitrary write con ict resolution rule as described for the Standard-DMM.
Simulation Techniques
Our description is based on the approach presented by Czumaj et al. (1995d) . We consider shared memory simulations on a DMM that are based on hashing. In a preprocessing phase each processor P i of the PRAM is mapped to processor Q i of the DMM. The memory of the PRAM is hashed using three hash functions h 1 ; h 2 ; h 3 : U ! f1; : : : ; ng. Each memory cell u 2 U of the PRAM (we say key for short) will be stored in the modules M h1(u) ; M h2(u) , and M h3(u) of the DMM. We will call the representations of u in the M hi(u) 's the copies of u. A class of hash functions H mapping U into f1; : : :; ng is k-universal (Carter and Wegman, 1979) , if for each u 1 < < u j 2 U, l 1 ; : : : ; l j 2 n], j k, and the hash function h drawn with uniform probability from H m;n , then Pr(h(u 1 ) = l 1 ; : : : ; h(u j ) = l j ) 2 n j :
For our purposes we require a log 2 n-universal class of hash functions H, such that a random h 2 H can be constructed fast, stored using little space, and evaluated in constant time. For example, we can use a p n-universal class of hash functions described by Siegel (1989) , or a class developed by Karp et al. (1993) . For the simulation of a PRAM step we use the majority technique due to Upfal and Wigderson (1987) . It ensures that it su ces to access arbitrary two out of the three copies of a shared memory to guarantee a correct simulation. Each copy of a key contains a time stamp indicating the update time. To write to a memory cell a processor of the DMM accesses at least two of the copies, updates them and adds a time stamp to them indicating the (PRAM-) time of the update. To read a memory cell a processor has to access two of the copies. This guarantees that at least one up-to-date copy is accessed. It can be recognized by its time stamp.
We modify this two out of three idea and split the schedule into three steps of trying to access one out of two copies with a di erent pair of hash functions in each step. In this way we always access at least two di erent copies out of the three possible. Using the majority technique we get a consistent simulation. Therefore in the following we will focus on the analysis of accessing one out of two possible copies of a shared memory cell, i.e. an access schedule that uses two hash functions h 1 and h 2 . Let us call such a schedule a one-out-of-two-schedule.
For technical reasons, we do not perform all n accesses to the shared memory simultaneously but split the requests into batches of size n=2 2c+6 , for some constant c 1 to be speci ed later. Since we only have a constant number of batches, this will slow down our algorithm only by a constant factor. We will focus in this section only on the requests to the memory of an EREW PRAM, so that all requested PRAM memory cells are pairwise distinct. In Section 4.2 we describe a result that enables us to generalize our simulations to the CRCW PRAM.
Let S denote a batch of n=2 2c+6 requests to the memory of the PRAM and let h 1 and h 2 be chosen uniformly at random from the log 2 n-universal class of hash functions H. Let H = (f1; : : : ; ng; E) be the labeled undirected graph de ned by h 1 ; h 2 and the set of requests S. The nodes are the memory modules of the DMM.
For each u 2 S there is an edge (M h1(u) ; M h2(u) ) labeled u in H. Note that parallel edges and self-loops are allowed in H, however all labels are disjoint.
One can view the one-out-of-two-schedule as the following process on the graph H. Each processor that wants to access a shared memory cell u 2 U asks in each step either M h1 (u) or M h2(u) . This corresponds to directing the edge labeled u in H to M h1(u) or M h2(u) , respectively. Then, if a module M j answers the request to cell u, the edge labeled u is removed from H. Summarizing, we direct in each step every edge in H and then every node removes one edge (if any) that points to it.
Before the next step starts, the orientations from the remaining edges are erased.
The simulation ends when all the edges from H are removed.
Note that, initially, each processor only knows one edge of H, namely the edge labeled with its request. Because the hash functions are chosen uniformly at random from a log 2 n-universal class of hash functions, H has similar properties as a random graph (here we use the assumption that all elements from S are pairwise disjoint).
The simulation we present relies on these properties of the graph H.
De ne the size of a connected component C, denoted by jCj, to be the number of nodes it contains. We restate the following two lemmas proved by Czumaj et al. (1995d) . We note that Lemma 3.1 implies the existence of a constant-time algorithm for removing all edges in H: Let C be a connected component in H and let T be an arbitrary spanning tree of C. Fix one node r in T and make it the root of T. Direct all edges in T towards the root and all other edges in C in an arbitrary way. Because there is only a constant number of edges in C that does not belong to T, only a constant number of edges in C will not be removed after the step. Thus a constant number of steps is needed to remove all the edges in C. For the future reference we will call such a schedule an o -line schedule.
4 Algorithmic Tools 4.1 Log-star-and Constant-time Algorithms
In this section we outline main algorithmic tools used by our algorithms.
If an array of size 2n contains at least n objects, we will call the array padded consecutive.
Given n integers x 1 ; x 2 ; : : : ; x n , the strong semisorting problem (Bast and Hagerup, 1993 ) is to store them in a padded-consecutive array, such that all variables with the same value occur in a padded-consecutive subarray.
Given n bits x 1 ; x 2 ; : : : ; x n , the chaining problem (Berkman and Vishkin, 1993; Ragde, 1993) is to nd for each x i , the nearest 1's both to its left and to its right.
Given m tasks distributed among n processors, the processor allocation problem is to redistribute the tasks so that each processor gets O(dm=ne) tasks.
As we mentioned in the last section, the n-processor DMM is essentially equivalent to the n-processor Arbitrary CRCW PRAM with O(n) shared memory.
Hence we can use algorithms designed for the CRCW PRAM to obtain the following lemma (for the proof see (Czumaj et al., 1995d) ).
Lemma 4.1 The following problems can be solved on the n-processor Standard-DMM (and therefore also on the R-DMM) in O(log n)-time with probability at least 1 ? 2 ?n " for some constant " > 0:
(1) strong semisorting (2) chaining
Sorting is a very important and comfortable subroutine that we use it in many places in our algorithms for the RM-DMM. Olariu et al. (1993) obtained the following result for integer sorting.
Lemma 4.2 A sequence of n integers in the range from 0 to n c for a constant c can be sorted deterministically in constant time on an n-processor RM-DMM.
We will also use the following lemma given in (Wang and Chen, 1990; Ben-Asher et al., 1991) . 
Reduction from the CRCW PRAM to the EREW PRAM
In Section 3 we assumed that an EREW PRAM is to be simulated and thus the elements in S are pairwise disjoint. In this subsection we present a reduction that allows us to focus only on an EREW PRAM to be simulated. Czumaj et al. (1995d) showed the following. Lemma 4.4 If an n-processor EREW PRAM can be simulated on an n-processor Standard-DMM with delay , w.h.p., then an n-processor Priority CRCW PRAM can be simulated on an n-processor Standard-DMM with delay O( + log n), w.h.p.. This result su ces for our simulation on an R-DMM. To achieve a constant-time simulation on an RM-DMM we develop a stronger reduction with constant delay for the RM-DMM.
Suppose that each processor P i of the CRCW PRAM wants to access memory cell i 2 U. Let q be a prime, q m, and s 2. Choose randomly two integers a; b 2 f0; : : : ; q ? 1g and de ne a function h(x) = (a + bx mod q) mod s, for x 2 U.
The following lemma is well known (see e.g. (Dietzfelbinger et al., 1994) ). Observe that a function h can be stored in O(1) cells, and can be generated and evaluated in constant time by one processor. After choosing a and b by one processor, h can be distributed to all other processors using two broadcasting steps on the RM-DMM. Then each processor can evaluate h in constant time.
In order to extend our simulations to CRCW PRAMs we have to show how we will deal with duplicate requests to the same memory cell. Fix to be a constant in Lemma 4.5 so that the required probability of the success 1 ? n ? is large enough. We rst perform integer sorting on pairs (h( 1 ); 1); : : : ; (h( n ); n). By Lemma 4.2 this can be done in constant time on the n-processor RM-DMM. Then, by Lemma 4.5, with high probability two values h( i ) and h( j ) are equal only if i = j . Hence, with high probability the addresses with the same value are stored in a contiguous subsequence of the sorted sequence. We choose the rst element from each such contiguous subsequence, call it the leader of this subsequence, and then proceed only with it as in the EREW PRAM case. Finally the leader has to broadcast the answer to read requests to the duplicates. The broadcasting can also be performed in constant time on an RM-DMM. Hence we obtain the following lemma. Lemma 4.6 If one can simulate an n-processor EREW PRAM on an n-processor RM-DMM with delay , then one can also simulate an n-processor Priority CRCW PRAM on an n-processor RM-DMM with delay O( ), w.h.p..
An O(log n)-Delay Simulation on an R-DMM
In this section we show a nearly constant-time simulation of a PRAM by adding the power of recon guration to the Standard-DMM model. We achieve a delay of O(log n) for a simulation of an n-processor EREW PRAM on an n-processor R-DMM. Using Lemma 4.4, this result extends to CRCW PRAMs.
Given an access graph H, assume the properties of Lemmas 3.1 and 3.2. Our algorithm rst nds the decomposition of H into its connected components, and then works on each of them independently. For each component we perform in parallel a lot of virtual access experiments to nd the best schedule, which is in fact a constant time o -line schedule. Finally, we execute this schedule.
A high-level description of the algorithm is as follows: We now describe the steps and their implementations in detail.
Step 1: Our rst goal is to achieve that the processors of each connected component of H agree on a leader. We divide this step into four substeps.
Step (1.1) nds a decomposition of each connected component into a constant number of \Euler cycles". In Step (1.2) we recon gure the R-DMM according to the Euler cycles and in Step (1.3) all edges from each cycle agree on one leader. In Step (1.4) in each connected component all Euler cycles are combined into one Euler cycle and a leader for the connected component is found.
Step 1.1: We replace each undirected edge (i; j) of H by two directed edges, in opposite directions, i; j] and j; i]. This guarantees that each component contains an Euler cycle. We assign a processor to each directed edge (also called arc). Hence, in the following we identify an arc with its processor. The capability of recon guration is only used for nding leaders in the Euler cycles. We order the arcs by the rst coordinates, i.e. the nodes they want to access. We use here the O(log n)-time strong semisorting algorithm (Lemma 4.1).
Then we use the chaining algorithm (Lemma 4.1) to nd, for each node v, its adjacency list. Now all processors that want to access a node v are standing in a consecutive adjacency array of v. Step 1.2: Now we recon gure the links between processors and modules of the R-DMM according to the cycles. That is, if arc e 1 precedes arc e 2 and arc e 2 precedes arc e 3 in a cycle, then the processor assigned to arc e 2 combines the link to the processor assigned to arc e 3 and the processor assigned to arc e 1 . Hence, the processors assigned to the arcs e 1 , e 2 , and e 3 are connected via one bus. If we look at the connected component, we have established the Euler cycle as a bus connecting the processors assigned to the arcs of this Euler cycle.
Step 1.3: Now each arc sends its identi er through the assigned bus. Because of the Arbitrary rule for con ict resolution on the bus, all arcs on each cycle get one identi er which is called the leader of the cycle.
Step 1.4: Next we combine all the cycles within each connected component. Each edge of H whose two directed arcs belong to di erent cycles, chooses the smaller of their two leaders and sends the identi er of the assigned processor on the cycle of the larger leader. If it succeeds (according to the arbitrary rule on a bus of the R-DMM), it combines these two cycles swaping successors from the arcs belonging to the edge.
Note that, by Lemma 3.1, there is no connected component C with more than jCj + O(1) edges, w.h.p.. Therefore, we only have to perform a constant number of combinings of cycles to join all of them into one Euler cycle for each connected component. Thus we can perform this step in constant time, w.h.p., and we have nally found a leader for each connected component.
Step 2: Let C be a connected component and let S(C) denote the number of edges in C. Note that S(C) = O(jCj) by Lemma 3.1. We use the strong semisorting algorithm (Lemma 4.1) to group all edges of C in a subarray B C of sizeS(C), S(C) S (C) 2S(C) and then compute the approximate size using the chaining algorithm (Lemma 4.1). Then we allocate in O(log n) time for each edge in C exactly 2S (C) processors, w.h.p., using Lemma 4.1. Hence, altogether we allocate processors, for some constant b. We can do this within our resources because of Lemma 3.1 and Lemma 3.2, that ensure that the total number of allocated processors is linear. We can view these assignments as having given 2S (C) DMMs of sizẽ S(C) for each connected component C.
Step 3: We systematically test all 2S (C) = 2 jCj+O(1) orientations of the edges of C in parallel. Thus, in O(log n) time, using strong semisorting, we can compute one with indegree at most s + 1, if C contains not more than jCj + s edges. (Note that by discussion at the end of Section 3 such an orientation does exist.)
Step 4: Apply the access protocol indicated by the orientations of the edges described above, all accesses are done after s + 1 = O(1) iterations.
Summarizing, only Step 1.1, Step 2, and
Step 3 of Simulation R-DMM need O(log n) time, w.h.p.. All other steps can be done in constant time. Hence, we get the following theorem.
Theorem 5.1 Simulation R-DMM simulates an n-processor EREW PRAM on an n-processor R-DMM with O(log n) delay, w.h.p..
This result can be extended in two directions. First, using Lemma 4.4 we can reformulate the result for a CRCW PRAM:
Theorem 5.2 An n-processor CRCW PRAM can be simulated on an n-processor R-DMM with O(log n) delay, w.h.p.. Second, we can transform our simulation into a time-processor optimal one. We use a result from (Czumaj et al., 1995d) : Lemma 5.3 If there exists a simulation of an n-processor EREW PRAM on an n-processor DMM based on a`1 out of 2' protocol with delay bounded by , w.h.p., then, using a constant number of hash functions, an ( n)-processor EREW PRAM can be simulated on an n-processor DMM with delay O( + log n), w.h.p..
As Simulation R-DMM ful lls the assumptions of this lemma we achieve a time-processor optimal simulation. Theorem 5.4 An n log n-processor EREW PRAM can be simulated on an nprocessor R-DMM with O(log n) delay, w.h.p..
Simulation with Constant Delay on a Recon gurable Mesh DMM
The recon gurable model of parallel computation most widely studied in the literature is the recon gurable mesh. In this section we want to transfer the simulation on an R-DMM with O(log n) delay to a simulation with constant delay on an n-processor RM-DMM.
For the simulation we use the algorithm Simulation R-DMM. We show that each step of Simulation R-DMM can be performed in constant time on an nprocessor RM-DMM.
We proceed in two steps. First, we simulate the recon gurable DMM with n processors on an n-processor RM-DMM with constant delay. This yields that we can perform any step of Simulation R-DMM that takes constant time on the n-processor R-DMM (that is, Steps 1.2, 1.3, 1.4, and 4) in constant time on the nprocessor RM-DMM. Then we show how to perform steps of Simulation R-DMM that take O(log n) time on the R-DMM in constant time on the RM-DMM. As a main tool we use an algorithm for sorting n integers on an n n recon gurable mesh in constant time (Lemma 4.2).
Simulation Between Recon gurable Architectures
The relationship between an n-processor R-DMM and an n-processor RM-DMM is stated in the following lemma.
Lemma 6.1 Each step of an n-processor R-DMM can be simulated deterministicaly with constant delay on an n-processor RM-DMM.
Proof: First we show how to simulate the communication of the R-DMM that does not use the recon guration on a recon gurable mesh, i.e., we simulate read and write steps of a Standard-DMM. Then we extend this simulation with respect to the capability of recon guration.
The simulation of a Standard-DMM (that is, the simulation of a step of the R-DMM which does not use recon guration of the links) is equivalent to the simulation of a CRCW PRAM with n memory cells, as mentioned in Section 2. Therefore, we can use here the constant-time simulation of the n-processor CRCW PRAM with n memory cells on the n n-recon gurable mesh from Lemma 4.3.
It remains to show how to simulate the capability of recon guration of the R-DMM, that is, the capability of each processor to combine two links into a bus.
Assume a processor Q l of the R-DMM wants to combine the link from the processors Q m to the processor Q k of the R-DMM. Let us denote the processors of the RM-DMM by R 1 ; : : : R n . Processor Q l of the R-DMM will be simulated by processor R l of the recon gurable mesh DMM, 1 l n. Figure 4 .
Because the buses in the R-DMM are edge-and processor-disjoint, each link of the recon gurable mesh is only used once. A communication step on a bus consists of two parts. First, the processor R l simulating processor Q l of the R-DMM sends the request to the (l; l)-switch. This can be done if all switches are connected (W ? E). Then all switches are recon gured with respect to the bus structure of the bus of R-DMM to be simulated as described above. Now, the (l; l)-switches can read from or write on the bus and send back the result to the processor R l . For the last step again all switches have to be con gured (W ? E). 2 6.2 Analysis of the Simulation on an RM-DMM
As the main result of this section, we obtain the following theorem.
Theorem 6.2 An n-processor Priority CRCW PRAM can be simulated on an n-processor RM-DMM with constant delay, w.h.p..
Proof: By Lemmas 4.6 and 6.1 it remains to show that we can perform each step of Simulation R-DMM that needs non-constant time on the R-DMM in constant time on the RM-DMM, that is, Step 1.1, Step 2, and Step 3.
Step 1.1: The algorithm for Step 1.1 is essentially the same as the one described in Section 5. To nd an Euler-tour for each connected component we replace each undirected (i; j) by two directed edges i; j] and j; i], and assign a processor to each directed edge. Now we use Lemma 4.2 to sort the directed edges with respect to the rst coordinate. This gives the adjacency list for each node. Next we can perform the standard Euler Tour construction as described in Simulation R-DMM. As a result every processor knows its successor in the Euler-tour. If every processor sends its ID to its successor, then every processor also knows its predecessor.
Step 2: In the analysis of this step we use the knowledge about the structure of the access graph, especially Lemma 3.2. First, we sort in constant time (Lemma 4.2) the processors with respect to the identi er of the Euler cycle they belong to. Because we also speci ed a leader, we can compute the size of each connected component in constant time. Then, we broadcast it to all members of a connected component using the bus in each connected component. We have to allocate to each processor an exponential number (in the size of the connected component it belongs to) of processors. Because we cannot use the O(log n)-time allocation algorithm as in the last section, we sort the processors on their number of requested processors.
Lemma 3.1 ensures that the numbers to be sorted are in the range of O(n). Lemma 3.1 gives an upper bound on the number of processors requesting at least 2 k processors. With high probability, at most 4 c+1 n k 2ck processors are in connected components of size at least k, and hence request at least 2 k processors. We allocate processors with respect to the distribution stated by Lemma 3.2. More precisely this means that we allocate 2 k processors, 1 k log n, to the requesting processors that are in the interval h P log n i=k+1 4 c+1 n i 2ci ; P log n i=k 4 c+1 n i 2ci ? 1 i of the sorted array. This deterministic allocation ensures that with high probability we allocate always enough processors, and Lemma 3.2 also ensures that the total number of allocated processors is linear. Altogether this step needs constant time.
Step 3: We perform the virtual access experiments on the copies of each connected component in constant time using the sorting algorithm from Lemma 4.2. It allows to sort in parallel the accesses in each copy of a connected component in constant time and to determine a constant time o -line schedule as described in the previous section.
2
Remark: In the simulation it seems that we need at some places an O(n)-processor RM-DMM. There are two ways to circumvent this problem. The rst way is to use a self-simulation described in (Ben-Asher et al., 1993) . It allows to simulate an O(n)-processor RM-DMM on an n-processor RM-DMM with constant delay. The second (and simpler) way is to set the value of c in Lemmas 3.1 and 3.2 such that each batch of size n=2 2c+6 will not require to use more than n processors and n modules. Then, we access the batches one after the other and need only an nprocessor RM-DMM for each batch. Therefore, we can assume a linear size of the model without loss of generality.
Finally, we can show that our simulation is optimal.
Theorem 6.3 Any randomized constant-time simulation of an n-processor EREW PRAM on an N-processor RM-DMM requires that N = (n).
Proof: Consider the permutation routing problem, that is, the problem where each processor P i has a packet to send and each processor is the destination of one packet. This problem can easily be solved on the n-processor EREW PRAM in constant time. We can use the following bisection argument to get a lower bound on the area-time 2 product.
Halve the N N recon gurable network by a horizontal line. Let all packets from the bottom half have destinations in the upper half and vice versa. Therefore, at least n packets have to cross the line. On the other hand, the line cuts only N links. Hence, to ensure that all n packets cross the line, N t must be (n). This shows that the area-time 2 product on a mesh is at least n 2 even in the randomized case, i.e., A T 2 = (n 2 ). Thus A = (n 2 ), i.e., N = (n) is necessary for T = O(1). 2
