The emergence of low latency, high throughput routers means that network locality issues no longer dominate the performance of parallel algorithms. One of the key performance issues is now the even distribution of work across the machine, as the problem size and number of processors increase. This paper describes the implementation of a highly scalable shared queue, supporting the concurrent insertion and deletion of elements. The main characteristics of the queue are that there is no xed limit on the number of outstanding requests and the performance scales linearly with the number of processors (subject to increasing network latencies). The queue is implemented using a general purpose computational model, called the WPRAM. The model includes a shared address space which uses weak coherency semantics. The implementation makes extensive use of pairwise synchronisation and concurrent atomic operations to achieve scalable performance. The WPRAM is targeted at the class of distributed memory machines which use a scalable interconnection network.
Introduction
The emergence of low latency, high throughput routers, reduces the dependence of parallel algorithms on network locality issues. The research issues are now concerned with the design of scalable systems and algorithms which address such issues as: the maximisation of the data locality, e cient access of remote data, the removal of substantial sequential components, and the minimisation of synchronisation overheads. An important problem for highly parallel machines is how to distribute data across the processors in a scalable and e cient manner. Shared memory machines solve this problem by atomically accessing a shared data structure using a lock based mechanism. The problem with this method is that it does not scale with the number of processors, due to the xed maximum rate at which a lock can be acquired or released (together with the period of queue access between). This leads to a xed maximum performance for such a solution, which is independent of the number of processors in the machine. When used for supporting ne-grain problems, this type of solution can quickly become the bottleneck. At a low level, the scalable access of single shared words can be supported on a distributed memory machine through the use of concurrent atomic operations. The simplest operation is the concurrent reading of a word. Some machines directly support the combining of such operations within the network (P ster, 1985; Gottlieb et al, 1985) . In such machines, the network switches implement the atomic operations and are able to combine separate requests and distribute subsequent results. Machines which don't support such operations directly can use the processors themselves, together with software combining trees (Valiant, 1992) which make use of simple point-to-point message passing operations. Such trees have been used to implement a barrier operation across a set of processors (Mellor-Crummey and Scott, 1991) , by setting the atomic operation to pairwise synchronisation. Other concurrent operations include integer addition and the exchange of the contents of a word with a new value. An attempt at a more general purpose concurrent data access method is the concurrent shared queue proposed by (Gottlieb et al, 1983) . This builds on the presence of a concurrent addition operation to gain scalable access to the elements of a static shared array. Each array element can be used to insert a single queue item until it is subsequently deleted. However, the queue is of a xed size, de ned by the number of elements in the static array. A concurrent stack data type, proposed by (Wilson, 1992) , again uses concurrent addition, but this time to distribute the stack operations across all of the processors directly (rather than using a shared memory abstraction). This paper presents an implementation of a shared concurrent queue for highly parallel multiprocessors, which has no xed limit on the number of outstanding queue operations. The queue is also implemented within a shared address space, with only the processors involved in queue operations maintaining its state. The implementation and performance of the queue is explained using the WPRAM computational model. The WPRAM (Nash et al, 1995; Nash and Dew, 1994 ) is a general model of parallel computation, targeted at the class of scalable distributed memory machines. The term scalable is used here to characterise networks which can accept packets at a constant rate from each processor (support a constant granularity), independent of the number of processors in the machine. Figure 1 demonstrates the approach. Di erent parallel platforms are characterised by an abstraction of a distributed memory machine called the \global router", consisting of a set of homogeneous processing elements (or nodes) using a at communications structure. Examples are the Cray T3D, IBM SP2 and Intel Paragon, which scale in performance up to 100's or 1000's of nodes. The results presented here use an IMS- (May et al, 1993 ). The WPRAM model is supported above the global router, with each operation having a well de ned asymptotic performance. The model provides a shared address space with weak coherency semantics (Gharachorloo et al, 1992) , and a set of concurrent atomic operations. In a weakly coherent shared memory, data is only guaranteed to be in a coherent state when the associated processes accessing that data take part in some explicit form of synchronisation. Example consistency models range from sequential consistency (Lamport, 1979) through to release and entry consistency (Gharachorloo et al, 1992) . The potential for improved data access performance in weaker data models can be gained through the pipelining of requests and delaying requests to take advantage of coarser grain access methods (Gharachorloo et al 1991; Keleher et al 1992; Zucker et al 1992; Bershad et al, 1991) . A set of synchronisation primitives in the WPRAM coordinate processing and maintain shared data consistency. The next section presents an overview of the WPRAM and global router models, and of the general methods in the scalable support of the WPRAM above the global router. The following section describes the main concurrent access operations used in the WPRAM to support the queue. The semantics of the queue are then given, together with its implementation and performance on the WPRAM, using complexity analysis and simulation results. A simulation of the queue using T9000 performance gures, and C104 simulation results from SGS Thompson (May et al, 1993) , is provided.
Modelling Scalable Message Passing Machines
This section describes the WPRAM model on which the queue is implemented. An overview of the support of the WPRAM on a scalable distributed memory machine, characterised by the global router abstraction, is also provided.
The WPRAM Model
The structure of the WPRAM memory model is given in gure 2. The WPRAM supports a shared address space in which data may take one of two forms:
global: This is uniformly accessible by all processors, and used when data locality is not an important issue.
local: Also accessible across all processors, but can be located on a speci c processor, resulting in improved performance.
The shared address space in the WPRAM is weakly coherent (Gharachorloo et al, 1992) . In other words, newly written shared data is only guaranteed to be visible to other processes when some form of explicit synchronisation takes place between them. The following forms of synchronisation are available:
process creation : A process group, consisting of identical processes, can be created either on the local processor, or across the machine.
barrier synchronisation : A barrier guarantees that all processes within a group have reached the same stage in execution.
tag synchronisation : A tag provides an event-based mechanism between an arbitrary pair of processes. One process will set the tag, and continue execution immediately. The other process will wait on the tag until it has been set (if it is already set then the process will immediately unblock).
In addition, the WPRAM includes the operation switch(), which allows a process to suspend until all of its outstanding remote accesses have completed. This supports the generation of shared data dependencies within a process without the need to synchronise with other processes in order to guarantee the completion of the accesses. The model supports a small set of concurrent atomic operations (P ster, 1985; Gottlieb et al, 1985) which allow the scalable access of shared variables. These operations take the general form read and op(I; c; &i), in which c is some constant value used to update a shared variable I, using the operation op, and i is a variable which will hold the value of I before the operation takes e ect. Example operations are given by op = add (addition) and op = swap int (integer exchange). Further details of these operations are presented in section 3.
Operation Time Complexity Process Management
Creation of N processes logP:log(min(N; P 0 )) + N=P the number actually taking part in the operation, N is a measure of the amount of work involved (for example, the number of words accessed or processes created) and k is a small integer constant. A more detailed description can be found in (Nash et al, 1995; Nash, 1993) . 
The Global Router Abstraction
The global router ( gure 3) is used to characterise the class of distributed memory machines with scalable interconnection networks. The model consists of a set of P processor/memory pairs, or nodes, which are able to communicate through the use of a global router. The router provides a point-to-point message passing service between nodes and is modelled by two parameters:
The latency between a packet entering the network and arriving at its destination. Granularity (g): The minimum period between which successive packets can be injected into the network by each node.
This represents a similar approach to that taken by the BSP computational model (Valiant, 1990) . A packet is a xed number of bytes in which a message is broken up before being transmitted. The class of scalable interconnection networks is then de ned as a global router with the parameters D = O(log P) and g = O(1). The requirement that g is independent of the number of nodes means that the ratio of communication to computation remains the same, and so the granularity of parallelism exploited by an algorithm is not e ected. For example, if a binary hypercube interconnection network accepts packets from all nodes at a constant rate then there will be O(P log P) packets in transit at any one time, since the latency of the hypercube is O(log P). This packet requirement matches the available network bandwidth of O(P log P). In contrast, a two dimensional grid would have O(P p P) packets in transit, which surpasses the available O(P) bandwidth. The value of g in this case must therefore be raised to O( p P) to restore balance (e ectively giving a reduction in the useful amount of parallelism exploitable). Underpinning these methods is the requirement that the network have a predictable latency under general tra c patterns (both randomised and systematic). This can be achieved through the use of randomised routing (Valiant, 1983) , which essentially decouples the data access patterns from the resulting network tra c. Adaptive routing (Klein, 1991) has been proposed as a more practical alternative since it uses free network links more e ciently. Results for both of these methods (Klein, 1991) have shown that predictable network latencies can indeed be obtained. Software combining techniques (Jones, 1993; Mellor-Crummey and Scott, 1991; Valiant, 1992; Yew et al, 1987) can be supported above the global router to provide scalable concurrent data access. Increased performance may be gained by using routing technology which supports combining directly in hardware (Gottlieb et al, 1985; P ster, 1985) . It is assumed here that concurrent access has the same cost as exclusive access. Finally, the global router supports a set of local operations, each with a constant cost. These include basic arithmetic and data transfer operations, process management and message handling (Nash, 1993) . Particular costs for these operations can be tted to a given processor. Table 2 provides the performance of each class of global router operation. These basic costs are assumed for the costing of the WPRAM model. The communication of a message of X words between two nodes is costed by the time required to inject the packets into the network and the latency incurred by the nal packet. The same is true of the access of X independent words in which there are no data dependencies.
Supporting a Shared Address Space
The global router can be used to provide the abstraction of a shared address space across the distributed memories of the processing nodes, as required by the WPRAM. Read and write requests to a shared variable V can be converted into message passing operations directed at the node holding V in its local memory. The two requirements of such an abstraction are that:
A node requiring access to V must be able to determine its location. Predictable data access times must be preserved.
The rst can be solved, for example, by the use of a linear hash function (Mehlhorn et al, 1984; Jones, 1993) . This will convert a shared address to a memory module on some global router node. The address of V , say V a , is passed through the function : f (xV a + y) mod N g mod P, for x; y 2 0::N ? 1 and x; N co-prime.
Predictable performance then comes from the even distribution of shared address accesses across the memory modules. Theoretical analysis has shown the class of universal hash functions (Mehlhorn et al, 1984) to provide this even distribution in an ammortized sense. However, performance studies (Ranade et al, 1988) have shown that linear hashing works well in practice. In terms of the WPRAM requirements, the selective choice of a hash function on a page basis (Jones, 1993) can be used to support global and local data in the same shared address space.
Supporting Pairwise Synchronisation
Simple point-to-point synchronisation between processes can be supported by the use of tag variables, as included in the WPRAM model. A tag can be in a set or unset (default) state. A process waiting on an unset tag will block until another process changes the state to being set. The code assumes that a process identi er is an integer value which is uniquely de ned across the machine. Each process has access to its identi er through the variable my pid. The value no pid is used to represent an unde ned process identi er. The concurrent exchange operation is used to atomically update the pid eld of the tag variable with the identi er of the process making the access. The global router operation deschedule() will suspend the calling process until a corresponding schedule() is received (or one has already arrived). The switch() operation will suspend a process until all outstanding data accesses by that process have completed. It is assumed that the t unset() operation will not be executed while other operations on the same tag variable might be in progress (avoiding a possible race condition).
Supporting Barrier Synchronisation
The creation of a process group across the machine requires a distributed solution. This is due to the potential interaction of large numbers of nodes. Given a root node i, at level 0 of a binary spanning tree, a node j in the tree at level l supports the subtrees rooted at nodes j, (j+P=2 l ) mod P, for P processors.
In the case of P = 8 and i = 3, this would produce the sequence f3g, f3,7g, f3,5,7,1g, f3,4,5,6,7,0,1,2g.
Given i is the node of the parent of a process group, the initiation of the group then leads to a sequence of messages travelling down towards the leaves of the tree, with each message including the current level l of that node. The factor logmin(N; P 0 ) in the associated WPRAM operation cost represents the depth of the tree, with the logP multiplying factor giving the latency between tree nodes. The factor of N=P 0 represents the sequentialisation of process creation within each node when N > P 0 . The barrier synchronisation of a process group can make use of the spanning tree information constructed when the group was created. A barrier operation would consist of the synchronisation within each node of any processes in the same group, followed by synchronisation rst up and then down the spanning tree. As messages travel down the tree, the processes within each node can be woken.
Concurrent Atomic Access
An example of a concurrent operation supported by the WPRAM shared address space is the addition of an integer value c to a variable I. A process executing read and add(I; c; &i), where i is used to hold a return value, leads to the following atomic sequence being executed: i = I I = I + c This is illustrated in gure 4(a) where each of P processors gain access to a unique array index. Each processor atomically increments an integer variable, denoted by Array Index. The result of this operation is a unique integer which can be used as an array index to reference an element (when taken modulo the number of array elements). Under the concurrent execution of this operation, results are returned according to some arbitrary ordering. In the WPRAM pseudocode above, each process concurrently updates I, placing the initial value in its variable i. If this result is below a constant WORK PACKETS (representing the number of computational tasks remaining) then the corresponding element of A can be initialised. This method is useful, for example, when the initialisation of elements of the array A can take varying amounts of time, and so cannot be statically allocated to processes. Another example is read and swap addr(I; c; &i) which exchanges a shared address value of a word with a new address. This will atomically execute: i = I I = c Figure 4 (b) shows the use of this operation. Each processor can append a new item onto the end of a linked list by swapping the address of its new item with the current value of the tail pointer. Under concurrent access, the tail pointer is updated with one of the new addresses and the other addresses are returned to the processors as the current value of the tail pointer. The code segment below demonstrates this method. If the address returned from I is non-null then the list must contain at least one item, and so the pointer V at the current tail of the list must be updated with the address of the new item it (allocate in global memory using the function new). The T eld within each list item can be used to block the process removing items from the list until the next item is available. This can be supported by the use of a tag variable.
global struct item *I = null; /* linked list tail pointer */ for j = 1 to P in parallel Process_j; /* create P processes */ Process_j : struct item *i; /* pointer to current end of list */ struct item *it; /* new item to insert */ it = (struct item*) new (sizeof(struct item)); The operation read and swap int(I; c; &i) is also available for the exchange of integer values, rather than a shared address. It should be noted that synchronous and asynchronous combining is possible. Under synchronous combining, all similar atomic operations are guaranteed to be combined, and some assumption on the ordering of the results can potentially be made. This is only possible if the processors themselves synchronise when performing the combining operations. Under asynchronous combining, the processors execute independently and no ordering of the results is assumed. It is also likely that not all requests will be combined, producing higher contention at the destination node than with synchronous combining. Since the WPRAM treats data access and synchronisation orthogonally, the concurrent access operations operate asynchronously, in the same manner as normal reads and writes.
A Scalable Shared Queue
This section provides a description of the implementation of the shared queue on the WPRAM model. The solution was developed in cooperation with Andy Jones of SGS Thompson.
Semantics of the Queue
A shared data type which occurs frequently in parallel applications is the rst-in rst-out (FIFO) queue. The queue implementation described here allows multiple processes to both insert and delete items concurrently. It is assumed that a deleter will block until an item is available in the queue but that an inserter will return once the item has been placed in the queue. For the WPRAM, the weak FIFO condition holds: a speci c ordering of the queue insertions and deletions between processes is only guaranteed by the explicit synchronisation of those processes involved. In other words, if a group of processes carry out insertions and deletions on the queue without synchronising then the ordering of those operations between the processes is arbitrary.
Implementation of the Queue on the WPRAM Model 4.2.1 Overview of the Implementation
It is required that the performance of the queue scales linearly with the number of processors P, subject to a degradation incurred by the network latency. More formally f=cP= log P, for a constant c and frequency f of queue insertion & deletion. The logarithmic factor represents the network latency. The solution described here is based on the concurrent access of a static array with a number of elements equal to the number of processors (Gottlieb, 1983) . The concurrent operation read and add makes it possible to gain access to a unique element of the array, by incrementing an integer which keeps track of the current array index (initially set to zero). The actual index number should be taken modulo the size of the array (this is assuming that the This static array is one element used to support the queue, and is shown in gure 5.
Although each processor can access the shared array concurrently, using read and add, the array itself cannot be used to directly manage the transfer of data in and out of the queue, since the number of outstanding insert or delete requests may exceed the size of the array. So there must be some means of dynamically allocating space at each array element in order to store references to inserted data, and requests for the deletion of data from the queue.
The problem is solved by each array element supporting two linked lists, as demonstrated in gure 5. One list will hold references to inserted data items, and the other pending delete operations. The two sets of lists can be accessed independently through the array by using two array indices: I for the insertion requests, and D for the deletions. Each array item holds pointers to the tails of its two lists, so that new items may be appended, and a tag variable to coordinate exchanges between each list pair (explained in more detail later on). Each list must be able to support atomic insertions, otherwise a lock would be required to protect each list (and some means for processes to block on the lock). The solution is based on the use of the read and swap addr operation, as described in sections 2.1 and 3, to atomically update shared addresses within the array elements and linked list items. (c) A subsequent deletion operation sets the delete list pointer. The list item contains its own tag variable t, which can be used to block the associated process until an inserted item is available.
(d) Since the delete item is also the rst in the list, it waits on the tag variable T to coordinate with the insert list. In this case, T is already set with the shared address of the insert item, and so this is returned to the delete item. This allows the Data eld of the insert item to be accessed.
(e) The delete item can now remove the insert item, which may uncover new items in the insert list.
It can also unset the tag T.
(f) Finally, the delete item can remove itself from the delete list, again possibly uncovering other items in the list. These issues are dealt with in more detail in the following sections. The data types shown above are used to support the queue implementation. Structure queue t describes the queue in terms of a static array of linked list pairs, pointed to by list array. These lists are accessed in the array using the integer index I for the insert lists and D for the delete lists. A list pair is implemented by the structure llist pair t. This holds pointers to the most recently inserted items in each both lists, and a pointer to a tag variable used to coordinate communication between the lists. Each item within a list, given by the data type item t, hold the address of the next item in the list in v. An item in an insert list will also point to the data which was inserted into the queue through the eld data. An item in a delete list will hold a pointer to a tag variable, used to block the associated deleting process until a queue item is available. The creation of a shared queue means setting the I and D variables to zero, and allocating the static array of P elements of type llist pair t. Each of the P nodes may then initialise its corresponding array element concurrently. This consists of setting the insert list and delete list pointers to NULL, and setting T to point to a tag variable.
The Creation of a

Insertion into a Queue
void Insert_Queue_ADT (struct queue_t *q, void *addr) struct llist_pair_t *array; /* points to the static array */ int ind; /* used to access I index value */ struct item_t *i; /* new item to place in queue */ struct item_t *last,*prev; /* pointers to other queue items */ int active = FALSE; /* TRUE : previous item in list active */
The above code lists the interface and internal variables of the insertion procedure. The procedure will place the address speci ed by addr into the queue and return immediately. An insertion operation requires that the process append an item onto an insert list so that it may be paired o with an item on the corresponding delete list. The operation can be split into three main procedures.
1. An insertion item is created, with the data eld set to the address of the data to insert into the queue, and the v pointer eld set to null. The operation new() supports the dynamic allocation of shared data blocks. A call to nodes() returns the number of processors being used. A read and add operation, aimed at I and using the constant 1 can then be used to determine which array element to use.
read_and_add (q->I,1,&ind); /* find new index value */ array = q->list_array; /* find address of static array */ i = (struct item_t*) new (sizeof(struct item_t)); /* new item */ switch(); /* guarantees ind, array and data allocation */ i->data = addr; /* set address of data to insert */ ind = ind % nodes(); /* find element of static array to use */ i->v = NULL; /* set pointer to next item to NULL */ switch(); 2. The item must then be appended onto the associated insertion list pointed to by the array element.
The method to achieve this uses the read and swap addr operation, which atomically exchanges the list tail pointer with the address of the new list item. The code above describes the interface to the deletion procedure. The procedure will return a shared address from the queue into addr, blocking until a queue item is available. The initial deletion procedure is very similar to parts 1 to 3 of the insertion method. The only di erences in the code are that:
D is used to access the static array, rather than I. A tag is created as part of the new list item (to block the deleting process until it is the rst active item in the list). When the item becomes the rst active list item, it attempts to read the address held in the tag, rather than writing to it, since this is the address of a new item on the insert list.
Once the address in the tag has been read, this can then be used to access the insert list item, and hence the address of the data inserted into the queue. An extra phase present in the deletion procedure is then to update the insertion and deletion lists. This involves uncovering subsequent list items, if available, so that other exchanges between the pair of lists may go ahead. There is potentially a race condition which exists at this point, in which new items append themselves to the list at the same time as earlier delete items test for their presence. Figure  7 demonstrates how this race condition can be avoided. There are two possible interleavings of events which can occur:
1. The process appending a new item to a list executes a read and swap addr on v, setting v to the address of the new item, and returning a null value ( gure 7(a) parts (i) and (ii)). The deletion procedure subsequently also executes read and swap addr on v (using any non-null address) and nds the return value is non-null ( gure 7(a) parts (ii) and (iii)). It then knows that the new item has already been appended and can set the tag in the array element to the address of the item (see stage c for the insertion procedure).
2. The deletion procedure executes the read and swap addr before the process appending the new item and so nds that v is null ( gure 7(b) parts (i) and (ii)). Since there are no new items the deletion procedure is complete. A process subsequently appending a new item onto the list using read and swap addr will then nd that v is non-null and so knows that the deletion procedure has already tested the pointer ( gure 7(b) parts (ii) and (iii)). The process therefore sets the tag within the array element with the address of this new item (see stage c for the insertion procedure).
The use of the read and swap addr operation on the pointer v removes the race condition by the fact that the operation can be used to atomically change the state of v. In fact, the atomicity characteristic of the operation is more important than its combining potential in this example, since the queue requests will be spread evenly across the array elements (and so across the separate lists). The last phase is for the deletion procedure to unblock any waiting processes. The deletion procedure must test the pointer within its item to check if another process is waiting in the list to use the queue (and if it is then to unblock it by setting the tag in this new item). The function DONE() is simply used to return any non-null shared address with which to update the pointers within each list item. The description below also uses the following variables for each process: struct item_t *next; /* a pointer to other list items */ struct item_t *i; /* set to point to new list item */ struct item_t *enq_item; /* set to inserter list item */
The code to accomplish this stage is as follows. /* free inserter item */ } 4.3 Discussion (Gottlieb et al, 1983) describe an implementation of a concurrent shared queue using a static shared array and concurrent addition. However, the description assumes the presence of some form of lock at each array element, to deal with contention for the array. (Wilson, 1992) describes the support of a concurrent stack using concurrent addition to determine the destination processor to send a push or pop request. This requires that all processors in the machine participate in the support of the shared data type, even if only a small fraction of them actually make use of it. The implementation of the queue provided here maintains its state directly within the shared address space (leading to the need for the concurrent exchange operation, to support atomicity).
A side-e ect of the queue implementation provided here means that insertion and deletion requests which have been serviced are still required to remain on the linked list until a new request is appended to the list. This is due to the added coherency problems if a serviced request attempts to alter the list pointer in the associated array item while a new request is also accessing this pointer. There is some scope for the optimisation of the queue implementation provided here. Dynamic allocation of new list items and tag variables could be initialised on the calling node, subject to the fact that remote nodes can still access them. This is precisely the semantics o ered by local shared memory in the WPRAM (see section 2.1).
Performance of the Queue
It is assumed here that the processes accessing the queue are spread evenly across the processors, as are each element of the static array, so that they can be accessed concurrently.
The procedure to insert an item into the queue uses the combinable operations to access the array and linked list and so does not synchronise with any other process. There are a constant number of operations, each costing O(logP), resulting in a total cost of O(log P). Assuming that the deletion procedure does not have to wait to reach the head of the linked list and that there is a corresponding insertion item already available (i.e. the tag in the array item is set), then the same conditions hold as for the insertion costing, resulting in a total cost of O(logP).
The above results mean that the insertion of N items, and their deletion, on a P processor machine will complete in time O(logP + N=P), ignoring the creation of the queue, since each queue access can proceed independently.
Simulation Results for the Queue
Operation Cost Network Parameters network latency D 414*log(P) network granularity g 10,000
Local Parameters data movement / integer arithmetic 20 oating point arithmetic 48 process management and scheduling 333 setup time to send a packet 1000 send / receive X bytes 100*X cache line size 16 page size 512 Table 3 : Global Router Costs for the IMS-T9000 and C104
A discrete-event simulation of the WPRAM model has been written so that the parallel execution time of an algorithm can be found. The simulator implements the mapping from the WPRAM to the global router, with the performance characteristics using gures for the IMS-T9000 processor and C104 packet router. Table 3 represents a summary of the global router costs (in nanoseconds). The values for the network granularity and latency are derived from SGS Thompson simulation results (May et al, 1993; Jones, 1993) , and an Esprit working paper (Klein, 1991) . The results are based on randomised tra c patterns and the use of randomised and adaptive routing methods using systematic tra c patterns. Full details of the simulator and mapping can be found in (Nash, 1993) .
Performance results for the queue are given in gure 8(a), and show the completion time of the algorithm in which the processors and the number of pairs of inserts and deletes vary between 1 and 100. The results assume that each inserter simply adds a single item and terminates and similarly each deleter removes a single item. The results correspond closely with the expected complexity of O(log P + N=P), and a plot of this complexity result for N = 100 (and example constant factors) is included in the graph for comparison. When P dominates, the expected time complexity is O(logP). This is shown in the graph by the results converging to the same point. When N dominates, the expected complexity is O(N=P). The results shown that in this case, the completion time increases rapidly due to the large numbers of processes per processor sequentialising the access to the queue. The uneven decrease in completion times shown in many of the plots results from the more even distribution of processes using di erent numbers of processors. For example, 100 operations can use the machine more e ectively using 50 processors than 40. demonstrates how the performance of the algorithm di ers when the number of processes per processor are changed. This was carried out to measure the practical use of exploiting parallel slackness (Valiant, 1990) in the algorithm, where the parallelism in the problem surpasses the number of processors. In the example, the amount of work (i.e. number of insert and delete operations) is maintained constant for a particular number of processors, but the number of processes (or contexts) per processor is changed, by altering the amount of work which each must complete. One context per processor means that each will carry out four queue insert and delete operations, whereas four contexts per processor will reduce this to one set of queue operations per context. The results show that changing from one to two contexts per processor brings a reduction in the completion time of around 30 percent. The reduction in execution time results from the increased overlap of communication and computation by a processor. Although increasing the number of contexts still further to four does slightly improve the completion time, the decrease is much less marked than before. Figure 9 (a) demonstrates the scalability of the queue through the use of the throughput measure. This measures the degree of concurrency which can be exploited as P increases, and is computed as T(P; 1)=T(P; P), where T(N; P) is the execution time of N inserts and deletes on P processors. This can be viewed as a special case of the speedup measure, in which the problem size increases linearly with the number of processors. Since each insert and delete operation can proceed independently (since there are P queue operations across P processors), the theoretical maximum throughput is O(P). The actual result shows the timings produced using the WPRAM simulator. This shows that the throughput of the queue always increases with the number of processors, but that, due to network delays, this increase is smaller each time. Calculating the expected throughput from the complexity results above gives the result O((P + N)=(log P + N=P)). This simpli es to O(P= logP), since N equals P in this case. The results seem to match the simulator output closely. This also corresponds with the required access frequency f = c P= logP de ned in section 4.2.
If the queue implementation used a locking mechanism then the throughput graph would atten out after a certain number of processors were reached due to the fact that a lock can only be acquired and released at a maximum frequency. In fact, if the acquire and release operations for a lock were not implemented scalably then the throughput graph may actually decline after a certain point, due to network and memory module contention increasing the latency of message tra c.
Finally, gure 9(b) shows the resulting throughput achieved through the use of a single linked list (supporting concurrent insertions). The throughput is calculated as O(P=P log P) = O(1= logP), since the sequential time is linear in the input size (P) and the parallel time is dominated by the sequential deletion of list items. However, this does not seem to match the simulation results presented in the graph. The reason for this is that the factor log P, representing the network latency, includes only a small multiplying constant, compared to the other constant factors making up the result. Thus, for large values of P, and assuming the logarithmic factor can be treated as a constant value, the complexity result reduces to O(P=P) = O(1). This closely matches the simulation results, and demonstrates the need to account for key constants in the performance results. The plot of \projected" throughput in the graph uses the tted form 1:5 f2P=(P + 1)g.
Conclusions and Future Work
The major contribution of this paper has been to show how data can be shared on a highly parallel multiprocessor, in a scalable and e cient manner, using a highly concurrent queue. The implementation of the queue is based on a simple distributed memory machine abstraction, called the global router, allowing pairwise process synchronisation and a limited set of atomic concurrent data access operations. A shared address space programming abstraction was used to describe the implementation. This abstraction has been simulated on a system of IMS-T9000 processors communicating using C104 packet routers, and has been de ned more formally in the WPRAM computational model.
It is interesting to note that a direct implementation of the queue on a distributed memory machine would require precisely the functionality provided by the WPRAM model. This includes the ability to combine concurrent accesses to a shared variable (in the access of a static array index or a list pointer) and the support of pairwise synchronisation between processes. Also, the ability to spread remote data accesses across the memory modules of the machine is required (used in the access of distinct array items and lists).
A priority of future work is to build implementations of the WPRAM model. A grant from the EPSRC Parallel Software Tools Initiative (GR/K40697) will allow an initial implementation on a PowerPC/T805 platform. The aim is to make this highly portable, so that e cient implementations can the be supported on other platforms, such as the Cray T3D. A research grant into the use of shared abstract data types for high level sharing within parallel systems has also been gained. As an example, the shared queue has been used to design a scalable load balancing method for highly irregular search trees (Nash et al, 1996) .
