This paper describes a scheme to implement a shared address space of size m on an n-node mesh, with m polynomial in n, where each mesh node hosts a processor and a memory module. At the core of the simulation is a Hierarchical Memory Organization Scheme (HMOS), which governs the distribution of the shared variables, each replicated into multiple copies, among the memory modules, through a cascade of bipartite graphs. Based on the expansion properties of such graphs, we devise a protocol that accesses any n 1=2 log n , using O ? log 1:59 n copies per variable. In both cases the access time is close to the natural O ( p n) lower bound imposed by the network diameter. A key feature of the scheme is that it can be made fully constructive when m is not too large, thus providing in this case the rst e cient, constructive, deterministic scheme known in the literature for bounded-degree processor networks. For larger memory sizes, the scheme relies solely on a nonconstructive graph of weak expansion. Finally, the scheme can be e ciently ported to other architectures, as long as they exhibit certain structural properties. In the paper we discuss the porting to the pruned butter y, an area-universal network which is variant of the fat-tree, and to multi-dimensional meshes.
Introduction
A desirable feature of a parallel computer is the provision of a shared address space that can be accessed concurrently by all the processors of the machine. Indeed, the manipulation of shared data provides a powerful and uniform mechanism for interprocessor communication, and constitutes a valuable tool for the development of simple and portable parallel software. Unfortunately, when the number of processors exceeds a certain (modest) threshold, any e cient hardware realization of shared memory is either prohibitively expensive or out of reach of current technology. Therefore, a shared address space must be provided virtually on hardware platforms consisting of a set of processor/memory module pairs which are connected through a network of point-to-point communication links.
This problem has received considerable attention over the past two decades, and has been the target of a large number of investigations, both theoretical and applied. In the theoretical community, the problem is best known as the PRAM simulation problem. An (n; m)-PRAM is an abstraction of a shared-memory machine consisting of n synchronous RAM processors that have direct access to m shared variables. In a PRAM step, executed in unit time, any set of n variables can be read or written in parallel by the processors. A solution to the PRAM simulation problem is a scheme to perform any computation of an (n; m)-PRAM on a target machine consisting of a network of n processor/memory pairs. A typical PRAM simulation scheme distributes the PRAM shared variables among the n modules local to the machine processors, and recasts a parallel access to the shared-memory into the routing of messages from the processors requesting the variables to the processors storing such variables.
Several randomized PRAM simulation schemes have been proposed in the literature. In all these schemes, the shared variables are distributed among the memory modules via one (or more) hash functions randomly drawn from a suitable universal class. Among the most relevant results, we recall that a PRAM step can be simulated, with high probability, in O (log log log n log n) time on the complete network CMS95], in O (log n) time on the butter y Ran91] and in O ( p n) time on the mesh LMRR94].
In contrast, the development of e cient deterministic schemes, that is, schemes that guarantee a fast worst-case simulation time for any PRAM step, appears to be much harder. A simple argument shows that in order to avoid trivial worst-case scenarios, where all the variables requested in the PRAM step are stored in a small region of the network, one has to use several copies for each variable, so that only a subset of \convenient" copies needs to be reached by each operation. The number of copies used for each variable is called the redundancy of the scheme.
The idea of replicating each variable into multiple copies dates back to the pioneering work of Mehlhorn and Vishkin MV84] . In their approach, a read operation need only access one (the most convenient) copy. For m = O n R , the authors obtain a scheme for the complete interconnection which uses R copies per variable and allows any set of n reads to be satis ed in time O n 1?1=R . However, the execution of n write operations, where all copies of the variables must be accessed, is penalized and requires O (Rn) time in the worst case.
Later, Upfal and Wigderson UW87] proposed a more balanced protocol requiring that, in order to read or write a variable, only a majority of its copies be accessed. They also represent the allocation of the copies to the modules by means of a Memory Organization Scheme (MOS). An MOS is a bipartite graph G = (V; U), where V is the set of shared variables, U is the set of memory modules of the underlying machine, and R edges connect each variable to the modules storing its copies. For m polynomial in n and R = (log n), the authors show that there exist suitable expanding graphs that guarantee a worst-case O log n (log log n) 2 time to access any n variables on the complete interconnection. This bound was later improved to O (log n) in AHMP87] . Several authors pursued the ideas in UW87] to develop simulation schemes for bounded-degree networks of various topologies. In particular, schemes have been devised to simulate an arbitrary step of an (n; m)-PRAM, with m polynomial in n, in time O log 2 n= log log n on a Mesh-of-Trees (MoT) with n processors and ? n 2 switching elements LPP90], or in time O (log n log m= log log n) on an n-processor expander-based network HB94], or in time O (log n log log n log log(m=n)) on a suitably augmented MoT Her96] .
All of the aforementioned deterministic schemes (except for the one in MV84] which, however, is not general since write accesses are heavily penalized) su er from two major limitations.
1. The MOS graphs must exhibit maximum expansion relatively to the m=n ratio. Although the existence of such graphs can be proved through standard counting arguments, no e cient constructions are yet available. In addition, it is unlikely that the (few) constructions known for expanders may be of use when m is much larger than n.
2. The expansion properties of the MOS are exclusively used to curb memory contention.
Network congestion issues are either ignored, as in the case of simulations on the complete network, or solved by means of separate mechanisms tailored to the speci c net-work's topology 1 .
Recently, constructive deterministic schemes exhibiting nontrivial performance have been developed for the complete interconnection. In PP97] three schemes are presented for m = O n 3=2 , m = O ? n 2 and m = O ? n 3 variables, which attain O n 1=3 , O n 1=2 and O n 2=3 access time, respectively, for any n-tuple of variables using constant redundancy.
These schemes rely on MOS graphs that admit e cient explicit constructions but exhibit weak expansion. In this paper we will exploit the same constructions in a more complex framework to achieve e cient implementations of shared data on realistic, low-bandwidth machines. Speci cally, we will develop a novel approach where the ine ciencies caused by the weak expansion of the memory map are absorbed into the inherent bandwidth limitations of the interconnection, and where both memory contention and network congestion are controlled through a single mechanism.
Overview of Results
This paper presents a deterministic scheme for implementing a shared address space of size m on an n-node square mesh, with m polynomial in n, where each node consists of a processor with direct access to a local memory module. The scheme provides a protocol to access an arbitrary set of n shared variables in nearly-optimal time, for all values of m. The scheme is fully constructive for m = O n 3=2 , whereas for larger values of m it embodies only a nonconstructive component graph of constant degree whose expansion properties, however, are much weaker than those required of the graphs used in previous works. Full constructiveness can also be attained for memory sizes up to m = O n 9=2 , at the expense of a progressive degradation in performance when m gets closer to the upper bound.
The scheme adopts a novel redundant representation of the shared variables and is centered around the Hierarchical Memory Organization Scheme (HMOS), which provides a structured distribution of the copies of the variables among the memory modules. The HMOS consists of k + 1 levels of logical modules built upon the set of shared variables. The modules of the rst level (level 0) store copies of variables, whereas modules of level i > 0 store replicas of modules of level i ? 1. The HMOS is represented by a cascade of bipartite graphs, where the rst graph governs the distribution of the copies of the variables to the modules of level 0, 1 In fact, in HB94] an MOS with slightly less than maximum expansion is employed in order to reduce the redundancy and, consequently, network congestion, at the expense of an increase in memory contention. However, such an MOS does not embody any speci c mechanism to explicitly control network congestion. and the other graphs govern the distribution of the replicas of modules at higher levels. Each level of the HMOS corresponds to a tessellation of the mesh into submeshes of appropriate size, with each module of that level assigned to a distinct submesh.
We devise an access protocol to satisfy n arbitrary read/write requests issued by the n processors, which takes advantage of the hierarchical structure of the HMOS. As customary in any scheme that uses multiple copies, the access relative to each variable is executed on a selected subset of its copies. A suitable copy selection mechanism is developed to limit the number of copies to be accessed in each submesh, and, ultimately, in each individual module. In this sense, the HMOS provides a single mechanism to cope with both memory contention and network congestion, which represents a novelty with respect to previous works, where the two issues were dealt with separately.
In order to achieve low memory contention and network congestion, thus guaranteeing fast access time, the HMOS component graphs must exhibit certain expansion properties. Compared to the graphs used in previous schemes, our graphs have much weaker expansion, attainable using only constant (as opposed to logarithmic) input degree. This makes the HMOS more amenable to explicit construction. Indeed, all HMOS graphs but the rst one are de ned as subgraphs of a well-known combinatorial structure, the BIBD, for which an explicit and simple construction is known. As for the rst graph, an explicit construction can be provided when m is not too large, thus making the HMOS fully constructive, while for large values of m, the graph can be shown to exist through a standard counting argument.
Our results are reported in detail below, and summarized in Table 1. Theorem 1 For any constant 1, there exists a scheme to distribute m = n shared variables among the local memory modules of an n-node mesh with redundancy R so that any n variables can be read/written in time Prior to the present work, no e cient deterministic schemes for implementing shared memory explicitly designed for the mesh topology were known in the literature. However, the schemes designed for the complete interconnection can be implemented on the mesh through sorting and routing. Finally, it is important to observe that our scheme is not speci cally tailored to the mesh topology, but can be ported, with minor adjustments, to other topologies. In particular, the same access times reported in Theorems 1 and 2 can be attained on an n-leaf pruned butter y, an area-universal network which is a variant of the fat-tree, and Theorem 1 can be generalized to hold for d-dimensional meshes, with constant d, by substituting n 1=d for n 1=2 in the formulas.
The rest of this paper is organized as follows. Section 2 de nes the machine model and introduces the routing and sorting primitives used by the access protocol. Section 3 describes the HMOS (Subsection 3.1) and its implementation on the mesh (Subsection 3.2). A suitable construction for the BIBDs used in the HMOS is given in an appendix to the paper. Section 4 presents the protocol for accessing an arbitrary n-tuple of shared variables.
This section is subdivided into two subsections that describe the selection of the copies and the routing protocol, respectively. In Section 5, suitable values for the design parameters of the HMOS are selected, and Theorems 1 and 2 are proved. Section 6 shows how the scheme can be generalized to other architectures, such as the pruned butter y and multi-dimensional meshes. Section 7 closes with some nal remarks.
Machine Model
We present our shared memory implementation on a mesh, consisting of an array of p n p n processor-memory pairs, connected through a two-dimensional grid of communication links. The machine operates in lock-step, where in each step a processor can perform a constant amount of local computation (including accesses to its local memory) and can exchange a constant number of words with one of its direct neighbors. Our objective is to devise a distributed representation of m n shared variables on the mesh so that any n-tuple of read/write accesses to these variables can be served e ciently. The approach will be generalized to other architectures in Section 6.
The access protocol will make use of the following primitives, for which optimal algorithms are known in the literature. We call`-sorting a sorting instance in which at most`keys are initially assigned to each processor and are to be redistributed so that the`smallest keys will be held by the rst processor, the next`smallest ones by the second processor, and so on, with the processors numbered in row major order. We have:
Fact 1 ( Kun93]) Any`-sorting can be performed on an n-node mesh in O (`pn) time.
We call (`1;`2)-routing a routing problem in which each mesh processor is the source of at most`1 packets and the destination of at most`2 packets. We have:
Fact 2 ( SK94]) Any (`1;`2)-routing can be performed on an n-node mesh in O ?p`1`2 n time.
A simple bisection-based argument shows that this result is optimal in the general case. However, for a special class of (`1;`2)-routings, a better performance can be achieved as follows. Fix a tessellation of the mesh into n=s submeshes of s nodes each, and consider an (`1;`2)-routing where at most s packets are destined for each submesh. We rst use`1-sorting and (`1; )-routing to spread the packets evenly among the nodes of their destination submeshes, and then complete the routing by running n=s independent instances of ( ;`2)-routing within each submesh. The overall routing time becomes:
Comparing the O ?p`1`2 n complexity of the general (`1;`2)-routing algorithm with the above routing time, we see that the new algorithm is pro table when ;`1 = o(`2) and s = o(`1n). This fact will be exploited in our access protocol, where packet routing is used to access selected copies of the variables. In particular, we will employ several nested tessellations of the mesh and provide strong bounds on the congestion within the submeshes of each tessellation, so that the above strategy can be applied. The packets will then be routed gradually to their destinations through a sequence of smaller and smaller submeshes.
The Hierarchical Memory Organization Scheme
This section introduces the Hierarchical Memory Organization Scheme (HMOS), a mechanism through which m shared variables are distributed among the n memory modules of a processor network. The section is organized in two subsections: Subsection 3.1 presents the logical structure of HMOS, while Subsection 3.2 describes its actual implementation on the mesh.
Logical structure of the HMOS
The HMOS is structured as k + 1 levels of logical modules built upon the shared variables, where k = O (log log n) is a nonnegative integer function of n, to be speci ed by the analysis. More speci cally, starting from m = n shared variables, for a xed constant > 1, the HMOS comprises m i modules at level i, called i-modules, for 0 i k, where the m i 's are strictly decreasing values that will be speci ed later. Modules are nested collections of variables, obtained as follows. First, each variable is replicated into r = (1) copies, which are assigned to distinct 0-modules. The contents of each 0-module, viewed as an indivisible unit, are in turn replicated into 3 copies, which are assigned to distinct 1-modules. In general, the contents of each (i ?1)-module, viewed as an indivisible unit, are replicated into 3 copies, which are assigned to distinct i-modules, for 0 < i k. It is easy to see that the above process will eventually create 3 k?i replicas of each i-module and r3 k copies per variable. In the rest of this paper, we will reserve the term copy to denote the replica of a variable, and i-block to denote the replica of an i-module.
The di erence between an i-module and one of its i-blocks is akin to the di erence between a variable and one of its copies. Namely, an i-module represents an abstract entity, of which several physical replicas, its i-blocks, exist. Since the contents of k-modules are not replicated, the terms k-module and k-block will be used interchangeably. Note that a k-block is made of (k ?1)-blocks, which in turn are made of (k ?2)-blocks, and so on until 0-blocks are reached.
The latter contain copies of variables.
Let V denote the set of shared variables, and U i the set of i-modules, for 0 i k. The HMOS is represented as a leveled direct acyclic graph (dag) H, which is de ned by a cascade of k + 1 bipartite graphs, namely (V; U 0 ) and (U i?1 ; U i ), 0 < i k, whose edges are directed left-to-right. In (V; U 0 ), each variable v 2 V is adjacent to r 0-modules, denoted by 0 (v; j), 1 j r. For 0 < i k, in (U i?1 ; U i ), each (i ? 1)-module u is adjacent to three i-modules denoted by i (u; j), 1 j 3. For notational convenience, we will number the levels in the HMOS starting from -1, the level of the variables. As a consequence, nodes at level i, 0 i k, are i-modules.
In the HMOS, each variable v uniquely identi es a single-source subdag H v induced by all nodes reachable from v 2 V . A straightforward property of H v is that it contains r3 k distinct source-sink paths which are in one-to-one correspondence with the r3 k copies of v.
Each source-sink path in H v (hence, each copy of v) is uniquely identi ed by the string of nodes traversed by the path. Moreover, for 0 i k, the su x of any such string starting with a node u at level i of H v , identi es a speci c i-block storing a copy of v. Note that several source-sink paths in H v may correspond to strings with a common su x starting from level i. In this case, the i-block corresponding to the common su x will store several distinct copies of v. A small HMOS for 8 shared variables is shown in Figure 1 (1-modules) The component bipartite graphs of the HMOS must be carefully chosen in order to guarantee a good distribution of the copies of the variables, once the HMOS is mapped onto the processors' memory modules. More speci cally, we require that the graphs exhibit good expansion, according to the following de nition.
De nition 1 Let G = (X; Y ) be a bipartite graph, where each input node in X has degree d. For 0 < 1, 0 < < 1 and 1 d, G is said to have ( ; ; )-expansion if for any subset S X, jSj jXj, and for any set E of jSj edges, outgoing edges for each node in S, the set ? E (S) Y reached by the chosen edges has size j? E (S)j = jSj 1? :
We let jU 0 j = n , where 0 < < 3=2 is a parameter to be xed by the analysis. We require that (V; U 0 ) has input degree r, for a xed odd constant r > 0, output degree jV jr=jU 0 j, and exhibits ( ; ; )-expansion, where = n=m, is a positive constant less than 1, and = (r + 1)=2. Clearly, a necessary condition for the existence of such a graph is ( jV j) 1? = n 1? n , which implies + ? 1 0. The analysis will determine suitable values for r, and that guarantee the existence of (V; U 0 ). Moreover, an explicit construction for such graph will be available when m = O n 9=2 .
The graphs (U i?1 ; U i ), for 0 < i k, are derived from instances of varying size of the same combinatorial structure, the Balanced Incomplete Block Design, de ned below. The degree of each node in X is q;
For any two nodes y 1 ; y 2 2 Y there is exactly one node x 2 X adjacent to both.
From the de nition, it immediately follows that jXj = w(w?1)=(q(q?1)) and that the degree of each node in Y is (w ? 1)=(q ? 1). One important property of the BIBD, which we will heavily exploit, is stated in the following lemma.
Lemma 1 Let G = (X; Y ) be a (w; q)-BIBD. Consider a node y 2 Y , and a subset S X such that any node in S is adjacent to y. Let that the remaining edges are evenly distributed among the outputs, so that each node of U i becomes adjacent to n i = 3m i?1 m i nodes of U i?1 . An e cient construction of such subgraphs is described in the appendix.
Mapping the HMOS onto the Mesh
The HMOS is physically mapped onto the mesh by storing each i-block in a distinct submesh of appropriate size. For some values of the parameter , the number of 0-blocks exceed the mesh nodes, hence a single mesh node must store more than one 0-block. There are storing the component (k ? 1)-blocks of the k-block assigned to the k-submesh. storing the i-block. Finally, the organization of the 3 k n 0-blocks depends on the parameter . When 3 k n < n, we assign each 0-block to a submesh of t 0 = n 1? =3 k nodes and evenly partition the contents of the block among these nodes. Otherwise, when 3 k n > n there are more 0-blocks than processors, so we assign 3 k n ?1 0-blocks to each processor. In either case, each processor stores r3 k m=n copies of variables.
The Access Protocol
Suppose that m shared variables are distributed among the n mesh nodes according to the HMOS. In this section, we present the protocol that realizes a parallel access to any n-tuple of variables, where each processor issues a read/write request for a distinct variable. (The case of concurrent accesses can be reduced to the case of exclusive accesses in time O ( p n) by means of standard sorting-based techniques for leader election and data distribution Lei92].) Let S denote the set of variables to be accessed. The access protocol consists of a copy selection phase followed by a routing phase. In the rst phase, a suitable set of copies for the variables in S is chosen, so that accessing these copies will enforce data consistency and generate low memory contention and network congestion. In the subsequent phase, the selected copies are e ectively accessed through an appropriate routing strategy. In case of read operations, the accessed data are returned to the requesting processors along the reverse routing paths.
Copy Selection Phase
Copy selection achieves the double objective of controlling both memory contention and network congestion by means of a single mechanism. The hierarchical structure of the HMOS provides a geographical distribution of the copies into nested regions of the network. By carefully limiting the number of copies that have to be accessed in any block at any level, we reduce the number of packets that will ever be routed to any such region, which allows us to adopt the e cient routing strategy illustrated in Section 2.
A New Consistency Rule
Recall from Section 3.1 that the r3 k copies of a variable v are associated with the source-sink paths of H v , the subdag of H induced by v and by all of its descendants. Suppose that we want to read/write v. In order to guarantee consistency, the copies of v to be accessed are selected according to a new rule, which extends the majority protocol of UW87] to t the structure of the HMOS. Speci cally, we require that the selected copies form a target set, which is de ned as follows. Let C v be a set of copies of v and let N(C v ) be the set of nodes of H v belonging to the source-sink paths associated with these copies. Recall that r is odd, and let = (r + 1)=2. An easy inductive argument shows that any two target sets for the same variable have nonempty intersection. Based on such a property we can guarantee consistency, that is, ensure that a read always returns the most updated value, as follows. As customary in any multi-copy approach, we equip each copy with a time-stamp, which is set to the current step whenever the copy is written. A read or write operation on a variable v is simulated by accessing a target set of its copies. By the intersection property of target sets, the copies accessed for reading a variable v must include at least one of the most recently written copies for v, which can be identi ed by looking for the most recent time-stamp. It should be noted that a target set contains only 2 k copies out of the r3 k total copies of a variable. Therefore, unlike previous protocols, we maintain consistency by accessing much less than a majority of the copies.
The Selection Procedure
The copy selection phase determines a target set C v for each v 2 S. This is accomplished in k + 1 iterations, numbered from 0 to k, during which the nodes of the H v 's are marked in a top-down fashion from the sources to the sinks. More speci cally, for every H v , with v 2 S, It is important to notice that a node of H, say an i-module u, may belong to several subdags H v corresponding to variables in S. During the copy selection procedure, we keep track of u independently in each such subdag, hence, u may result marked in some of the subdags and unmarked in the others. Suppose that at the end of Iteration i u is marked in some subdags, and that a total of h marked paths in these subdags reach u. This implies that at the end of copy selection there will be h2 k?i marked paths that pass through u, that is, h2 k?i copies in S v2S C v stored in i-blocks of u. In Iteration i + 1 the two successors of u to be marked are chosen to be the same for all subdags in which u is marked; hence, for each chosen successor node, u will contribute h paths to the total number of marked paths that will pass through that node at the end of the iteration. The main idea behind the copy selection phase is to control congestion in (i + 1)-blocks by choosing the nodes to be marked in Iteration i + 1 in such a way to keep the number of marked paths passing through each such node under some reasonable bound.
In order to describe the copy selection procedure in detail, we introduce the following notations:
De nition 4 For 0 i k, A i U i denotes the set of i-modules that are marked in some H v , with v 2 S, during Iteration i. The modules in A i are called active modules.
De nition 5 The weight w(u) of an active module u 2 A i is the sum, over all variables v 2 S, of the number of marked paths from v to u in H v .
From the previous discussion we conclude that exactly w(u) copies in S v2S C v will reside in each of the selected 2 k?i i-blocks of u, while the other i-blocks of u will not contain any copy in S v2S C v . Using the expansion properties of the HMOS, we will be able to guarantee suitably low values for the w(u)'s.
The actions performed by the copy selection procedure are reported below. Since Iteration 0 is di erent from the others, it is described separately. In order to understand the parameters used in Iteration 0, recall that we chose (V; U 0 ) to have (n=m; ; )-expansion, where 0 < < 1 and = (r +1)=2. In other words, the graph guarantees that for any set S of at most jSj n variables and any choice of 0-modules adjacent to each variable, the overall number of chosen 0-modules is at least jSj 1? , for some constant > 0. Finally, recall that each processor is in charge of a distinct variable. 2. Let CP 0 denote an initially empty set. The following three substeps are executed until copy-packets for each variable are put in CP 0 .
(a) Sorting: Sort all unmarked copy-packets by their third component. The analysis will show that the set CP 0 is determined in at most log n + 1 iterations of
Step 2. 6. Assume that a leader processor p u receives h 3 module-packets back. If h < 2, then p u selects 2?h extra module-packets from the 3?h that were not chosen in the previous step. Otherwise, p u selects two module-packets among those received; 7. Let p u ; u; (u; j 1 ); w(u)] and p u ; u; (u; j 2 ); w(u)] be the two module-packets selected by p u . Processor p u sends the names (u; j 1 ) and (u; j 2 ) to the processors storing the copy-packets in its group; When Iteration k terminates, for each v 2 S, p v determines the set C v of copies to be accessed as those corresponding to the 2 k source-sink marked paths in H v . It is easily seen that for each v 2 S, the set C v computed by the copy selection procedure is indeed a target set for v.
Analysis of the Selection Procedure
We now determine the running time of the selection procedure described above. 
Lemma 2 After log n + 1 executions of Step 2 in Iteration 0, the set CP 0 contains exactly n copy-packets. Moreover, for each active 0-module u 2 A 0 , w(u) (2r= )n .
Proof: Let S j be the number of variables for which fewer than copy-packets have been selected by the end of the j-th execution of Step 2. For the sake of convenience, set S 0 = jSj = n. We now show by induction that S j n 2 j ;
for any j 0, and this will imply that for T = log n + 1, S T = 0. The inequality for S 0 is immediate, establishing the basis. Assume that the inequality holds for j ? 1, and suppose, for a contradiction, that S j > n=2 j . By using the expansion properties of (V; U 0 ), it is easy to see that in the j-th execution of the selection substep, at least S 1? j 0-modules are addressed by unmarked copy-packets. All such 0-modules must have been congested in the previous iteration (i.e., addressed by more than (2r= )n copy-packets), which accounts for a total of at least 2r n S 1? j > rS j?1 unmarked copy-packets involved in that iteration. However, this is impossible since S j?1 variables account for at most rS j?1 copy packets.
The bound on w(u) is easily established by observing that for each 0-module u, all copypackets with third component u that are added to CP 0 are marked during the same iteration of
Step 2. Therefore w(u) 2r n :
2 It must be remarked that the copy selection phase can be improved in a number of ways to obtain a faster running time at the expense of a more complex implementation. However, to avoid further complications to the presentation, we chose to describe a simpler yet slightly less e cient implementation, since, as shown in Section 5, its complexity does not in uence the overall running time of the access protocol.
To complete the analysis, it remains to establish the bound on the weight w(u) of any u 2 A i , at the end of Iteration i. Recall that the sum of the multiplicities of the copy-packets in CP i with third component u yields w(u). Therefore, for 0 i k, Suppose that the inequality holds for i ? 1 and let x be an i-module. The weight of x is determined by Steps 4 and 6 of Iteration i. More precisely, recall that P x is the set of module-packets of kind p u ; u; x; w(u)] selected in Step 4, and let P 0 After copy selection is completed, the copies in S v2S C v have to be accessed. Each request is encapsulated in a distinct packet, routed from the requesting processor (origin) to the processor storing the copy (destination), and back to the origin. The idea is to route the packets in stages so that they are moved gradually closer to their destinations through smaller and smaller submeshes, in accordance with the tessellations de ned on the mesh. As argued in Section 2, when the number of packets destined for any submesh is not too large, such a strategy yields more pro table results than sending the packets directly to their destinations.
The origin-destination part of a packet's journey consists of k+2 routing stages, numbered from k + 1 down to 0. Stage i, with k + 1 i 1, is executed in parallel and independently in every i-submesh (here, the whole mesh is viewed as a (k + 1)-submesh). In this stage the packets are routed to arbitrary positions in the (i ? 1)-submeshes hosting their destination (i ? 1)-blocks, in such a way that the processors of each submesh receive approximately the same number of packets. This can be achieved by rst sorting the packets according to their destination submeshes, and then ranking the packets destined to the same submesh. Observe that when 3 k n < n, a 0-block is assigned to a 0-submesh of t 0 = n 1? =3 k nodes. By the end of Stage 1, each packet reaches a processor within its destination 0-submesh, and in Stage 0 is sent to its nal destination. Instead, when 3 k n n, there are n ?1 3 k 0-blocks stored within a single processor, hence each packet is at its nal destination by the end of Stage 1, and Stage 0 is not needed. In either case, once the packet reaches its nal destination, the request it carries is satis ed.
In order to estimate the time complexity of the above protocol, we need to determine the maximum number of packets sent and received by any processor in each stage. More formally, let i , for k + 1 i 0, denote the maximum number of packets held by any processor at the beginning of Stage i. Let also ?1 be the maximum number of packets received by a processor at the end of Stage 0, when such stage is needed (i.e., when 3 k n < n). ; for k i 0:
When 3 k n < n, we also have ?1 = O ( n ) :
Proof: The statement is immediately evident for k+1 , since every target set contains 2 k copies. By Lemma 3, an i-block is addressed by at most c 2 i n 1?(1? )=2 i packets, for k i 1.
Since there are t i = 3 i?k n 1? =2 i processors storing an i-block, we have Furthermore, recall that m = jV j = n , for some constant > 1 and that = (r + 1)=2.
The goal of this section is to determine suitable values of the above parameters that guarantee the existence of graph (V; U 0 ) and yield a good performance of the access protocol.
Such performance is closely related to the redundancy of the HMOS, that is, the number of copies (r3 k ) used per variable. On the one hand, using many copies per variable yields better access times, while, on the other, lower redundancy yields simpler and space-e cient schemes. We will consider two scenarios: in the rst scenario, we optimize the parameters under the assumption that the number of copies for each variable r3 k can grow arbitrarily large. In the second scenario, we optimize under the restriction that the scheme uses no more than a constant number of copies for each variable. We need the following technical result, which is a straightforward adaptation of Lemma 4 in PP97a]:
Lemma 6 Let m = n , with constant > 1. There is a suitable constant c > 0 such that for any odd constant r c log , a random bipartite graph G = (V; U 0 ) with jV j = m, jU 0 j = n, input degree r and output degree mr=n has (n=m; ; )-expansion with = (r + 1)=2 and = ( ? 1)= , with high probability.
We are now ready to prove one of the main results of this paper, which was stated in Subsection 1.1.
Proof of Theorem 1: We x = 1 and choose r to be the smallest odd integer greater than maxfc log ; 6( ? 1)g. For such values, Lemma 6 ensures the existence of (V; U 0 ) with = , whence T = O n 1=2+ and R = r3 k = O 1= log 3 2 = O ? 1= 1:59 . By instead xing k = log 2 log 2 n + O (1), so that 2 k+1 log 2 n, we have T = O n 1 2 log n and R = O log 1:59 n . 2
As already noted before, the HMOS underlying the above result is fully constructive, except for the rst graph (V; U 0 ), for which Lemma 6 only guarantees existence. In practice, one can resort to a random graph for (V; U 0 ), which, as the lemma shows, will exhibit the required expansion property with high probability. Although no explicit construction for (V; U 0 ) is known in the general case, this graph needs only weak expansion, which makes it more amenable to explicit constructions than the graphs employed in previous schemes (e.g., UW87, AHMP87]).
In fact, an explicit construction for (V; U 0 ) can be obtained when the shared memory size m is within certain ranges. For example, PP97] shows how to construct a bipartite graph with m = n 3=2 inputs, n outputs and input degree r = 3, which has (n=m; 1=3; 2)-expansion.
This graph can be e ciently represented using constant storage per node. Thus, using this graph as (V; U 0 ) when m = n 3=2 , the result of Theorem 1 still holds and the HMOS becomes fully constructive.
A larger range of values for m for which the HMOS can be made fully constructive, still yielding nontrivial performance, can be obtained by employing other graphs for (V; U 0 ). This is shown below, thus proving Theorem 2, which was stated in the introduction.
Proof of Theorem 2: Let us consider rst the case 3=2. We assume m = x(x ? 1)=6, where x is an even power of three. The argument for di erent values of m requires only trivial modi cations. Fix n = x = n =2 and choose (V; U 0 ) as an (n ; 3)-BIBD. By
Corollary 1, such graph has (1; ; 2)-expansion, with = 1=2. Since 2 + 3 ?3 = O (1= log n), the complexity of the access protocol is still given by Equation (3), and the same argument used to prove Theorem 1 carries through. Consider now the range 3=2 < 13=6 and choose n and (V; U 0 ) as before. By plugging = 1=2 and n = n =2 in the complexity formula given in Theorem 3 and choosing k = O (1) large enough and even, the complexity of the access protocol becomes Note that the access time of the constructive scheme tends to O (n) as m approaches n 9=2 , a performance that can be obtained through a straightforward scheme.
Extension to Other Architectures
A closer look at the access protocol developed in the previous sections for the mesh reveals that it solely relies upon a recursive decomposition of the network into subnetworks of the same type, and upon`-sorting and (`1;`2)-routing primitives. As a consequence, our scheme can be ported to any network topology that exhibits a suitable decomposition into subnetworks, and for which an e cient implementation of the above primitives is available. In this section we brie y discuss the porting of the scheme to the pruned butter y and to multi-dimensional meshes.
An n-leaf pruned butter y, introduced in BB95], is a variant of Leiserson's fat-tree Lei85]. Its coarse structure may be interpreted as a n-leaf complete binary tree where the leaves represent the processor-memory nodes of the machine, the internal nodes represent clusters of routing switches, and where the edges represent channels whose bandwidth doubles every other level from the leaves to the root. More precisely, each subtree of n 0 leaves is connected to its parent through a channel of capacity p n 0 . The pruned butter y is an important interconnection since it is area-universal in the sense that it can route any set of messages almost as e ciently as any circuit of similar area.
It follows from the de nition that an n-leaf pruned butter y can be decomposed into 4 i (n=4 i )-leaf pruned butter ies connected through channels of capacity q n=4 i , a decomposition similar to the one of the mesh employed in our scheme. Moreover, it is shown in HPP95] that -sorting and (`1;`2)-routing can be performed on the pruned butter y in the same running time as on the mesh. This immediately implies that both Theorem 1 and 2 also hold for the pruned butter y
We now consider the extension of the scheme to d-dimensional meshes, with d constant. It has to be remarked that the bandwidth of a d-dimensional mesh increases with d, hence, in order to achieve access time close to the natural n 1=d lower bound, the expansion required of (V; U 0 ) must also increase with d. For this reason, the graphs for which an explicit construction is currently available do not exhibit su cient expansion to grant a generalization of Theorem 2; however they can still be used to yield fully constructive schemes with nontrivial O n 1=d+ d access time, for suitable constants d < (d?1)=d. The details follow from tedious yet trivial arithmetic manipulations, which are omitted for the sake of brevity.
Conclusions
In this paper, we devised a scheme for implementing a shared address space on a mesh of processor/memory pairs. The scheme enables the processors to read/write any n-tuple of shared variables concurrently and yields a quasi-optimal access time in the worst case. One of the most relevant novelties of our implementation is represented by the hierarchical memory organization scheme, the HMOS, which provides a structured distribution of copies of the shared variables among the memory modules. In particular, the HMOS succeeds in the following objectives, which were not attained by the memory organizations known in the literature: (i) it provides a single mechanism to cope with both memory contention and network congestion. In this fashion, copy selection can be employed to reduce both; (ii) it yields fast access time by using a cascade of bipartite graphs with weak expansion, rather than using one graph of maximum expansion, which greatly simpli es the implementation. Indeed, the HMOS is fully constructive and yields quasi-optimal performance for any memory size m = O n 3=2 , which is su cient, for example, to run any NC algorithm. For large memory sizes, the HMOS embodies only one nonconstructive graph of weak expansion.
The design of the HMOS is not speci cally cast for the mesh topology. We showed that it can be implemented on the pruned butter y and on d-dimensional meshes yielding good performance. More generally, our scheme is e ciently portable to any low-bandwidth interconnection where routing takes advantage of partitions of the processors into subnetworks, in the sense that it achieves higher performance by moving messages gradually closer to their destinations through smaller and smaller subnetworks, rather than by sending them directly to their destinations. A challenging and long-standing open problem remains the construction of bipartite graphs that exhibit good expansion. The availability of explicit constructions and concise representations for such graphs is crucial for attaining simple and e cient deterministic shared memory implementations for all memory sizes. Recent developments in this area PP97] seem to indicate that the construction of graphs with a linear number of edges and moderate expansion, such as those required in our scheme, be easier than the construction of the highly expanding graphs used in previous schemes. If this is true, our scheme could become a general and constructive tool for the implementation of shared memory on distributed memory machines based on low-bandwidth interconnections.
Finally, we wish to point out that in a recent paper HPP95], which appeared after the results in the present paper were rst presented PPS94, PP95], a shared memory implementation scheme for the mesh is devised that, through a novel and complex protocol, achieves O ?p n log n access time. However, this scheme relies on a nonconstructive graph of maximum expansion, hence it su ers from the same limitations a ecting other schemes in the literature, as discussed in the introduction. The paper also proves an p n log(m=n 2 )= log log(m=n 2 ) lower bound on the access time of any deterministic scheme for implementing m = ? n 2 shared variables. The lower bound assumes that variables are accessed through a point-topoint protocol, which requires that a processor dispatch a separate message for each copy it wants to update. The assumption is satis ed by the scheme presented in this paper, which implies that our access time is only a sublogarithmic factor away from optimal.
X 3 = f (h; A; B) : h =`; 0 A < z; B = wg :
It is easy to verify that jX 1 j + jX 2 j + jX 3 j = m. Proof: Let u be associated with the vector (a d?1 ; : : : ; a 0 ). We determine the value of by separately counting the contributions of the nodes in the three subsets X 1 ; X 2 and X 3 . Consider X 1 and x h <`. Using the properties of eld operations, one can easily show that for any B, 0 B < q h , there exists exactly one value A such that the node (h; A; B) is connected to u. Therefore, there are exactly P`? 1 h=0 q h = (q`?1)=(q?1) nodes of X 1 connected to u. A similar argument shows that exactly w nodes of X 2 are connected to u. Finally, it can be seen that the z nodes of X 3 are connected to qz distinct output nodes, therefore, according to whether u is one of such nodes or not, we know that either = (q`? 1)=(q ? 1) + w or = (q`? 1)=(q ? 1) + w + 1. By (4) we conclude that qm q d qm q d :
Note that when m is a power of q, it must be z = 0 and therefore = qm=q d for every output node. In PP93] it is proved that the above rule is correct, i.e., no two (i?1)-blocks of (i?1)-modules are assigned the same location within the same i-module. Moreover, it is not di cult to show that 0 `< .
Observe that the structure of any (U i?1 ; U i ) is completely determined by the parameter d i . Since each d i can be derived from n, we conclude that, in order to represent (U i?1 ; U i ), a processor needs only know n. From this parameter, the processor can determine the exact location of any copy of any (i ? 1)-module performing O (log n) operations (arithmetic or in IF 3 ).
