Two techniquesfor managing memory on a parallel random access machine (PRAM) are presented. One is a scheme for an n/log n processor EREW PRAM that dynamically allocates and deallocates up to n records of the same size in O(log n) time. The other is a simulation of a PRAM with initialized memory by one with uninitialized memory. A CREW PRAM variant of the technique justifies the assumption that memory can be assumed to be appropriately initialized with no asymptotic increase in time but a factor of n increase in space. An EREW PRAM solution incurs a factor of O(log n) increase in time but only a constant factor increase in space.
Introduction
Procedures for memory management are commonly assumed tools for algorithms that maintain dynamic data structures. Such tools have been thoroughly studied for sequential machines for decades. This paper presents two techniques for managing memory on a parallel random access machines (PRAM). One is a scheme for an n/log n processor Exclusive Read Exclusive Write PRAM that dynamically allocates and deallocates memory to any subset of the processors in O(log n) time. The other is a scheme for simulating PRAMs possessing initialized memory with corresponding PRAMs possessing uninitialized memory.
A PRAM is a common abstraction of a parallel machine that is useful for the design and analysis of parallel algorithms. It is a collection of synchronized independent sequential processors with unique identifiers and a shared global memory. In each time step, each processor can read a location in the global memory, perform a local computation, and then write to a location in the global memory. In an EREW (Exclusive Read Exclusive Write) PRAM no two processors may simultaneously access the same memory location for either reading or writing; in a CREW (Concurrent Read Exclusive Write) PRAM, simultaneous reading, but not writing, is permitted. (See Karp and Ramachandran [1] for an overview of PRAM models and results.)
PRAM algorithms that manipulate dynamic data structures need to be able to secure and release memory from the global store in parallel. For example, Paul, Vishkin and Wagener [2] present algorithms for maintaining 2-3-trees that depend upon dynamic allocation and deallocation of memory. An extension to general B-trees [3] similarly depends on parallel memory management. Section 2 describes a general scheme for an n/log n processor EREW PRAM that dynamically allocates and deallocates up to n memory segments of the same size in O(log n) time.
When designing PRAM algorithms it is sometimes convenient to assume that all memory is appropriately initialized. For example, Schenk [4] uses this assumption to detect whether a given memory cell has been written during the course of the computation. Since it is conceivable that this assumption may make a problem easier to solve than it would otherwise be, it is important to determine its cost in terms of time, processors and memory. A technique from the folklore of computer science [5] simulates initialized memory with uninitialized memory for a sequential random access machine with only a constant factor increase in time and space. In section 3 we adapt this technique to the n processor CREW PRAM setting with no asymptotic increase in time but a factor of O(n) increase in memory size, and to an EREW PRAM setting with a factor of O(log n) increase in time and a constant factor increase in space.
Memory Management
This section describes algorithms and data structures that maintain PRAM memory in such a way that deallocated memory is captured and reused rather than new memory being consumed. We present algorithms for dynamically allocating and deallocating sets of n or fewer records of a fixed size on an n/log n processor EREW PRAM in O(log n) time.
Data Structures
The algorithms maintain all available memory in two data structures using an additional array w 1 , . . . , w 3n for working space. A contiguous array, mem start , mem start+1 , . . . contains the memory that has never been allocated. A linked list of balanced binary trees called the free-list contains deallocated records. The first tree in the free-list contains at most 2n ? 1 records, and if there are two or more trees, then the first tree has at least n ? 1 records. All other trees in the list contain exactly n ? 1 records. Initially all available memory is in the array; that is, start is set to one and the free-list is empty.
We assume that the size of the record, say record-size, is large enough to contain two pointers, left and right, which are used to maintain the free-list structure. For each tree in the free-list, one additional record is used to maintain the linked list of trees by having left point to the tree stored in the linked list, and right point to the next record in the linked list. Records that form a tree use left and right as usual to point to the roots of left and right subtrees. The variable head points to the head of the free-list, and size counts the number of records in the tree at the head of the list (including the extra record that forms the linked list).
Subroutines
Two tree manipulation routines are required by the allocation algorithm: one constructs a balanced binary tree out of an array, the other maps a tree into an array.
A procedure, called construct-tree(w 1 , . . . , w m ), constructs a balanced binary tree from the records pointed to by w 1 , . . . , w m . This is done by mapping element 1 onto the root, and elements 2i and 2i + 1, if they exist, onto the children of i. If m is known in advance, this can be performed optimally in O(m/p) time on an p processor EREW PRAM, by assigning at most dm/pe elements of the array to each processor. If m is not known, an extra log p term is required to broadcast the value of m to all processors. Thus, procedure construct-tree takes O(m/p + log p) time.
A procedure, called map-tree(T, w 1 , . . . , w m ), maps the nodes of the balanced binary tree T of size m onto an array w 1 , . . . , w m in O(m/p + log p) time on a p processor EREW PRAM as follows. First, the top blogpc levels of the binary tree are mapped onto the array using blogpc iterations of a parallel loop. The i th pass through the loop maps, in parallel, all nodes at level i onto the next available positions in the array. Then each node at level blogpc is assigned to a processor which recursively (and independently) maps the subtree found at its assigned node onto the array.
Allocation and Deallocation
Allocation of k n records, if k > size, is from the front of the array mem. Otherwise the k required records are allocated from the tree at the head of the free-list. If this tree contains fewer than n records after k have been removed, then it is merged with the next tree (if any) in the list. Using the procedures described in subsection 2.2, this requires O(log n) time on an n/log n processor EREW PRAM. Deallocation of k n records is accomplished by adding them to the tree at the head of the free-list. If the resulting tree has more than 2n records, n records are formed into a separate tree and inserted just after the head of the free-list. Using the procedures given in subsection 2.2, this requires O(log n) time on an n/log n processor EREW PRAM. Taking these results together, we have the following theorem.
Theorem 1 For k n, k records of the same size, say s, can be allocated or deallocated in O(log n) time on an n/log n processor EREW PRAM. Furthermore, for all times t, the total space used is (A t ? D t )s + O(n), where A t (respectively D t ) is the total number of records allocated (respectively deallocated) at or before time t.
To manage different record sizes, separate free-lists can be maintained for each size of record. It is not necessary to maintain more than one array mem. Allocations of each record size still require the cooperation of all n/log n processors.
By replacing the linked list of trees with a linked list of arrays, our algorithms can be transformed into constant time n processor CREW PRAM algorithms for allocating and deallocating up to n records.
Initialization
This section describes two simulations of an n processor PRAM with initialized memory by an n processor PRAM with uninitialized memory. Let α be an algorithm for an n processor EREW PRAM with initialized memory and let t α (n) be the time complexity and s α (n) be the space complexity of α on inputs of size n. The first simulation, S(α), simulates α on an n processor EREW PRAM with uninitialized memory in O(t α (n) log n) time and 3s α (n) + O(n) space. Let β be an algorithm for an n processor CREW PRAM with initialized memory and let t β (n) be the time complexity and s β (n) be the space complexity of β on inputs of size n. The second simulation, R(β), simulates β on an n processor CREW PRAM with uninitialized memory in O(t β (n)) time and (n + 3)s β (n) + O(n) space. Both simulations proceed in a step by step manner and maintain a data structure in such a way that it is easy to distinguish between a value that has been written and uninitialized garbage.
EREW PRAM simulation
Denote the memory array of the n processor EREW PRAM with initialized memory by M. On the n processor EREW PRAM with uninitialized memory, create one extra integer variable, next-space, and O(n) cells of working space, and partition the remaining memory into three arrays, A, B and C by interleaving. Let X l ] denote location l of array X. Simulation S will use array A to simulate memory M by ensuring at every step t of algorithm α:
1: if M l ] has been written by α at or prior to step t, then, after S(α) simulates step t, A l ]
contains the last value written to M l ] by α.
Simulation S will use arrays B and C and the variable next-space to keep track of those positions of M that have been written by ensuring at every step t of algorithm α:
2:
M l ] has been written by α at or prior to step t, if and only if, after S(α) simulates step t,
Property 2 is the key to determining whether a memory location has been previously written; property 1 ensures that the correct value is available for any previously written location. We show inductively how parallel reads and writes are executed while maintaining properties 1 and 2. Property 1 holds initially; setting next-space to one ensures that 2 holds initially. Suppose, at step t + 1 of algorithm α, processors 1, . . . , n access the unique memory locations M l 1 ], . . . , M l n ] respectively, and suppose both 1 and 2 hold at time t. Assume, for the moment, that we have an EREW PRAM procedure write-check(l 1 , . . . , l n , x 1 , . . . , x n , y 1 , . . . , y n ) that, for j ∈ f1, . . . , ng, does the following: sets x j to true if B l j ] ∈ f1, . . . , next-space ? 1g V C B l j ]] = l j and then sets y j to the size of the set fi j : :x i g.
(Procedure writecheck will be described shortly.) Given write-check, reads and writes are easily simulated. First, each processor j determines whether location M l j ] has already been written by executing writecheck and checking the returned value of x j . According to 2, M l j ] has been written if and only if x j is true.
If the memory access instruction is a read, then for each processor j, if x j is true then j reads A l j ]; otherwise M l j ] has not been written so j assumes the initial value for the read. According to 1, if M l j ] has been written then A l j ] = M l j ]. Since no contents of memory are changed by a read, properties 1 and 2 remain true after the read step.
If the memory access instruction is a write, each processor j writes to A l j ]. Then each processor j that wrote to a memory location l j that had not been previously written, updates B l j ] and a location of C to ensure that invariant 2 is maintained. This is done with the help of y j , which is also returned by writecheck. Since there are y j ?1 processors with identifiers less than j that also wrote to a new location, the number n j = next-space +y j ? 1 is unique for processor j. Processor j sets B l j ] to n j and C n j ] to l j . Finally, processor 1 updates the value of next-space to next-space+y n . Thus properties 1 and 2 hold after the write step provided they held before the write.
In summary, given write-check, the following procedures perform, respectively, one parallel read or write step of the simulation S. Procedure read sets v j to the value in M l j ]. Procedure write sets A l j ], which simulates M l j ], to the value v j and then updates arrays B and C to maintain the invariants 1 and 2.
procedure read ((l 1 , v 1 
write-check(l 1 , . . . , l n , x 1 , . . . , x n , y 1 , . . . , y n ) 2.
procedure write ((l 1 , v 1 
for j ∈ f1,. . . , ng pardo
next-space ← next-space + y n .
Clearly step two of read and steps two and three of write complete in constant time on an n processor EREW PRAM. It remains to provide the details for write-check.
Since the EREW PRAM algorithm being simulated satisfies exclusive memory access, all l j 's are distinct. However, some care is still needed to ensure exclusive access in the calculation of write-check. According to property 2, to compute x 1 , . . . , x n , each processor j must Step (a) is achieved by broadcast, which distributes the value of next-space to each processor. Broadcast is a basic technique [6] that uses O(log n) time on an n processor EREW PRAM, and does not require any initialized memory.
Step (b) is more involved because processors j and k reading uninitialized locations may happen to have
. . , next-space ? 1g. As a consequence, both j and k have to access the same location in C. We deal with this problem as follows. To determine (b), first all processors cooperate to lexicographically sort the set of pairs f(B l j ], j) : j is a processor numberg.
Denote the result by
Processor i then initiates a broadcast of this value to each processor σ(p) such that i p k, by reporting it to processor σ(i). Cole's parallel merge sort [7] can be used to sort in O(log n) time on an n processor EREW PRAM. As before, the broadcast takes O(log n) time. The values y 1 through y n are computed from x 1 through x n in O(log n) time on an EREW PRAM using Prefix Sums [1] . All other operations of write-check take just constant time. Finally notice that all subroutines used here (broadcast, prefix sums, and Cole's sort) are unaffected by uninitialized memory. Theorem 2 Let α be an algorithm for an n processor EREW PRAM with initializedmemory taking t α (n) time and using s α (n) space. Then S(α) simulates α on an n processor EREW PRAM with uninitialized memory in time O(t α (n) log n) and space 3s α (n) + O(n).
CREW PRAM simulation
In the EREW PRAM simulation, the function write-check uses O(log n) time in order to avoid read collisions of the variable next-space and (possibly) of locations in array C. If concurrent reads are permitted then determining x 1 through x n can be achieved in constant time with no change in data structures. Determining y 1 through y n , however, requires more than constant time even with concurrent reads. To circumvent this problem, and exploit concurrent reads to achieve constant time for each read and write, we replace array C and variable next-space with a separate array C j and variable next-space j for each j ∈ f1,. . . , ng.
An abbreviated description of the CREW PRAM simulation follows.
On the n processor CREW PRAM with uninitialized memory, create n extra integer variables, next-space j for j ∈ f1, . . . , ng, and O(n) cells of working space, and partition the remaining memory into n + 2 arrays, A, B and C j for j ∈ f1, . . . , ng by interleaving. Array
A is used to simulate the memory, say M, of the n processor CREW PRAM with initialized memory. Entries in B are now a pair (p, i) where p is an integer between 1 and n and i is an index into array C p . Simulation R maintains the following two part invariant for every location l in M: To initialize the structure, each processor j sets next-space j to 1 thus ensuring that the invariant holds initially. Suppose, as previously, at some time step of algorithm β, processors 1, . . . , n access memory locations M l 1 ], . . . , M l n ] respectively, and suppose the invariant holds before simulating this step. Let write-check 0 (l 1 , . . . , l n , x 1 , . . . , x n ) be a procedure that
for j ∈ f1, . . . , ng. Notice that since the model permits concurrent reads, write-check 0 takes only constant time by having each processor j compute P j independently.
A CREW PRAM read procedure is obtained from the procedure read by replacing write-check with write-check 0 . One parallel write step of the simulation R is as follows.
For each j ∈ f1, . . . , ng, v j is the value to be written to A l j ], which simulates M l j ].
procedure crew-write ((l 1 , v 1 ) , . . . , (l n , v n )) 1.
write-check 0 (l 1 , . . . , l n , x 1 , . . . , x n It is straightforward to check that the invariant holds after crew-write provided it held immediately before its execution. Since the model is exclusive write, all l j are distinct and thus step two of crew-write takes constant time on a CREW PRAM.
Theorem 3
Let β be an algorithm for an n processor CREW PRAM with initializedmemory taking t β (n) time and using s β (n) space. Then R(β) simulates β on an n processor CREW PRAM with uninitialized memory in time O(t β (n)) and space (n + 3)s β (n) + O(n).
Discussion and Open Problems
In our simulations, each instruction executed by the PRAM with initialized memory is emulated by a short sequence of instructions on the PRAM with uninitialized memory. Of course, for any memory cell not written during the run of the algorithm, the contents of the original and simulated memories may differ. To extend the simulations so that the final outputs are the same, we have to assume that the final output is actually written by the PRAM processors rather than just being declared as some arbitrary part of the contents of memory.
According to simulation R, it is justifiable to assume that memory is preinitialized when determining a problem's time complexity on a CREW PRAM. We do not know if such an assumption can be justified for an EREW PRAM. Furthermore, we do not know how to justify the assumption for a CREW PRAM without increasing the size of memory by a linear factor.
