Abstract. This paper presents and analyses efficient implementations of a so-called direct process on distributed memory machines (DMMs) that yields -a simulation of an n-processor PRAM on an n-processor optical crossbar DMM with delay O(log log n), -a simulation of an n-processor PRAM on an n-processor arbitrary DMM with delay O( l~176 ), ~, log log log n -an implementation of a static dictionary on an n-processor arbitrary DMM with parallel access time of O(log* n). We further prove a lower bound for executing the above process, showing that our implementations are optimal.
Introduction
Parallel machines that communicate via a shared memory, so-called parallel random access machines (PRAMs), represent an idealization of a parallel computation model.
The user does not have to worry about synchronization, locality of data, communication capacity, delay effects or memory contention.
On the other hand, PRAMs are very unrealistic from a technological point of view; large machines with shared memory can only be built at the cost of very slow shared memory access. A more realistic model is the distributed memory machine (DMM), where the memory is partitioned into modules, one per processor. In this case a parallel memory access is restricted in so far as only one access to each module can be performed per parallel step. Thus memory contention occurs if a PRAM algorithm is run on a DMM; parallel accesses to cells stored in one module have to be sequentialized.
Many authors have already investigated methods for simulating PRAMs on DMMs. If one focuses on a complete network between processors and modules, the main problem is the distribution of the shared memory cells over the modules to allow fast accesses. A standard method is to use universal hashing for distributing the shared * email: fmadh@uni-paderborn.de. Supported in part by DFG-Forschergruppe "Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4", by Volkswagen Foundation and by the Esprit Basic Research Action Nr 7141 (ALCOM II) ** email: chrschQuni-paderborn.de. Supported by the DFG-grant Me 872/6-1 (Leibniz Preis) *** email: vost@hni.uni-paderborn.de. Supported by the DFG-Graduiertenkolleg "Parallele Rechnernetzwerke in der Produktionstechnik", ME 872/4-1 memory among the memory modules of the DMM. In this paper we consider both simulations of PRAMs and implementations of parallel static dictionaries on DMMs, based on distributing the shared memory cells among the modules using not only one but severM hash functions.
Computation Models
A parallel random access machine (PRAM) consists of processors P1,...,Pn and a shared memory with cells U = {1,... ,p}, each capable of storing one integer. The processors work synchronously and have random access to the shared memory cells. In this paper we will only consider the exclusive-read exclusive-write PRAM (EREW PRAM) model, that is, no two processors are allowed to access the same shared memory cell at the same time during a read or write step.
A distributed memory machine (DMM) consists of n processors QI,...,Q~ and n memory modules M1,..., M,. Each processor has a link to each module. A basic communication step of such a DMM consists of the processors sending read or write requests to the memory modules, at most one request per processor. Each module processes some of the requests directed to it and sends an acknowledgement to each processor whose request was chosen to be processed.
We distinguish between the following rules for choosing requests for processing.
(c >. 1 is a fixed integer. For a discussion of the models see [DM93] or [M92].)
-arbitrary DMM : In this case, one arbitrarily chosen request out of all requests arriving at one module is processed per step. The answer given by a module is accessible by all processors accessing the module. 
Dictionaries and Shared Memory Simulations
Shared memory simulations on a DMM based on hashing begin with a preprocessing phase. In this phase each processor Pi of the PRAM is mapped to processor Qi of the DMM and the shared memory cells (we say keys for short) of the PRAM are distributed among the modules of the DMM via a > 1 randomly and independently chosen hash functions from some suitable universal class of hash functions (see below), i.e. each shared memory cell has a copies. This redundant storage representation needs space a. IUI, In this paper we will only deal with a _> 2. The basic access distribution phase (we say basic process for short) will be organized in such a way that each processor Pi that wants to get access to a key xl E U tries to send requests to the modules containing copies of xl until it got access to at least b of the a copies of xi, b < a. To resolve conflicts arising from colliding requests the modules will work according to the c-collision rule or the arbitrary rule. This process is direct in a sense introduced by Goldberg et al. [GJL93] . A process for distributing the requests of the processors to the modules is called direct if it runs in rounds and in each round the only messages allowed are requests of an arbitrary number of copies of each key. If we choose, for example, b > 2 then the basic process yields a simulation of an n-processor EREW PRAM on an n-processor DMM using the trick introduced in [UW87], which we will call the majority trick : If each shared memory cell possesses a > 2 copies distributed among the memory
