Abstract. We present three explicit schemes for distributing M variables among N memory modules, where M = ®(Nl'5), M = ®(N2), and M = ®(N3), respectively. Each variable is replicated into a constant number of copies stored in distinct modules. We show that N processors, directly accessing the memories through a complete interconnection, can read/write any set of N variables in worstcase time O(NI/3), O(Nl/2), and 0(N2/3), respectively for the three schemes. The access times for the last two schemes are optimal with respect to the particular redundancy values used by such schemes. The address computation can be carried out efficiently by each processor without recourse to a complete memory map and requiring only O (1) internal storage.
I. Introduction
Consider a parallel system with N processors and N memory modules collectively storing M > N variables that are available for access by the processors. A scheme is sought to distribute the variables among the modules so that any set of N variables can be efficiently accessed by the processors in parallel. This problem, originally referred to as the granularity problem, naturally arises in the design and implementation of parallel * This paper was partially supported by NFS Grants CCR-91-96152 and CCR-94-00232, by ONR Contract N00014-91-J-4052, ARPA Order 8225, and by the ESPRIT III Basic Research Programme of the EC under Contract No. 9072 (Project GEPPCOM) . Results reported here were presented in preliminary form at the 10th Symposium on Theoretical Aspects of Computer Science (Wiirzburg, Germany, 1993) , and at the 5th ACM Symposium on Parallel Algorithms and Architectures (Velen, Germany, 1993) . systems (such as PRAMs and parallel databases) and has received considerable attention in the literature. An early survey [ 11 ] quotes 14 papers that deal with some special cases. More recently, it has become the main focus of the large body of work concerning the simulation of the PRAM on more feasible machines.
Often, the problem is studied on a synchronous system where each processor is directly connected to all the memories, and each memory module is able to fulfill at most one access request (read/write) per time unit (Module Parallel Computer (MPC)) [13] . Thus, the time needed to access a set of variables is mainly determined by the maximum number of requests that a single module must fulfill. With this modeling, one can focus on reducing memory congestion without dealing with routing problems, which arise when processors and memories are connected through a Bounded-Degree Network (BDN) .
A number of efficient randomized schemes have been developed for both the MPC and BDNs, based on the use of universal classes of hash functions to distribute the variables among the modules. It has been shown that N variables can be accessed in O(log N) time on a BDN [17] , and in sublogarithmic time on an MPC [4] , with high probability. On the other hand, the development of efficient deterministic schemes, which is the focus of this paper, appears to be much harder. The pioneering work of Mehlhorn and Vishkin [13] introduced the idea of representing each variable by several copies so that a read operation needs to access only one (the most convenient) copy. This is necessary to avoid the worst case when all the requests are addressed to the same module. For M ~ O(Nr), they present a memory organization scheme for the MPC that uses r copies per variable and allows a set of N read requests to be satisfied in time 0 (r N l -l/r). However, this use of the copies penalizes the execution of write operations where all the copies of the variables must be accessed, thus requiring O(rN) time in the worst case.
Later, Upfal and Widgerson [ 19] proposed a more balanced use of multiple copies exploiting the majority concept previously adopted for databases [5] , [ 18] . Each variable is represented by r copies, where r is called the redundancy of the scheme. Each copy contains the value of the variable and a timestamp indicating the last time that particular copy has been accessed. A read/write operation needs to access only a majority [r/2J + 1 of the copies to assure that the most recent value of the variable is always retrieved. The assignment of the copies to the memory modules is governed by a bipartite graph G = (V, U; E), where V denotes the set of variables, U the set of modules, and r edges connect each variable to the modules that store its copies. For M polynomial in N and r ~ ®(log N), Upfal and Widgerson show that there exists a graph G, with suitable expansion, which allows the MPC processors to access any N variables in O (log N(log log N) 2) worst-case time. They do not provide an explicit construction for G but show that a random graph exhibits the desired property, with high probability. The access time was later improved to O(log N) in [1] .
Subsequently, several authors have adopted a similar framework to devise schemes for specific BDNs, trading the advantage of using more practical interconnections with a sublogarithmic increase in the access time [7] , [12] , [8] , [9] . It has to be noted that all these schemes, as well as those for the MPC, aim at a fast access time and, for this purpose, need logarithmic redundancy. In [2] it is shown how to reduce the amount of global redundancy to a constant by using a suitable coding of the PRAM memory, at the expense of a more involved access protocol. In [16] , instead, the redundancy is regarded as a parameter and a class of schemes is devised, generalizing the one of [ 19] . The access time associated with each scheme is expressed as a function of the redundancy, showing a close relationship between these two quantities.
All the schemes presented in the aforementioned papers rely on expanding graphs for which no efficient implementation is known, other than resorting to a random graph. This represents the basic shortcoming (maybe fatal from the practical standpoint) of this class of approaches, for the following reasons. No efficient way is known of testing the expansion property of a random graph. As pointed out in [15] , the only known technique, based on the second eigenvalue of a certain matrix related to the adjacency matrix, cannot be applied when the sizes of the two sets V and U differ by more than a constant factor, which is the case for nontrivial memory organizations. Furthermore, the representation of the memory map poses substantial implementation problems. How can a processor determine, for any variable, the modules storing its copies and the physical address of each copy within its module? The hypothesis of a complete memory map stored internally in each processor appears eminently impractical due to memory blowup. On the other hand, the approach proposed in [8] , where the memory map is distributed among the processors with only polylogarithmic memory blow-up, has a rather involved implementation, which makes it less attractive for practical applications.
In this paper we present three schemes to distribute ®(Nl5), O(N2) , and O(N 3) variables, respectively, among the N modules of the MPC. The schemes are presented in a framework similar to the one of [ 19] ; however, the graphs used for the distribution of the variables are given explicitly and the redundancy, in all three cases, is a small constant. To read/write any given set of N variables, a simple access protocol is provided, and its worst-case performance is analyzed exploiting the expansion properties of the graphs. The results are summarized in Table l , where for each scheme, identified by the number of variables M, we indicate the redundancy r, the time to satisfy N worst-case data requests, and the storage required in each processor to represent the memory map.
The important feature of these schemes is represented by the construction and implementation of the bipartite graphs governing the variable distribution, and the analysis of their expansion properties. The graphs are constructed by associating the two node sets, V and U, with certain quotients of PGL2(q n) (the group of nonsingular 2 x 2 matrices over the field Fq,, modulo its center), or with suitably chosen subsets of them. The edges are then defined between cosets with nonempty intersection. This technique was introduced in [ 14] for the construction of bounded concentrators, which are bipartite graphs whose node sets have almost equal size. It has the advantage of providing the graphs with a rich algebraic structure and a remarkable "isotropy" which make them very attractive for a number of applications. We exploit such structures to determine their expansion properties and to devise an efficient implementation. 
The relevance of our schemes is twofold: (1) They are the first constructive approaches known to achieve sublinear worst-case access time for both read and write operations. (2) Their implementation is simple and involves only elementary algebra; in particular, a processor can efficiently determine the physical location of any copy with a limited use of resources. Although the time performance appears less attractive than that of the nonconstructive schemes cited before, it must be pointed out that this is essentially caused by the use of constant redundancy (which is desirable from a practical standpoint).
In [ 16] it shown that, by using aft.red number r of copies per variables, the time complexity of any memory organization scheme is at least f2 (min{N/r, (M/N)~/~Lr/2j+l~}), for M ~ fl(Nl+'). This result proves that our schemes for M = ®(N 2) and M = (~ (N 3) are optimal, with respect to the specific values of the redundancy that they use. The rest of the paper is organized as follows. In Section 2, we describe the basic structure of our schemes and present the access protocol used to satisfy a set of variable requests. The protocol's running time is given in terms of the expansion property of the graph governing the variable distribution. Section 3 introduces most of the notations and background facts concerning finite fields and the group PGL2(qn), which are used throughout the rest of the paper. Sections 4 and 5 present the three schemes by defining the underlying graphs and studying their structural properties. In particular, for each graph we determine its expansion and provide a suitable representation that allows a processor to calculate the addresses of the copies of any given variable efficiently. For convenience, a number of technical facts, needed for the implementation of the first scheme, are reported in the Appendix.
Framework
As mentioned in the introduction, the parallel model used for our memory organization schemes is the MPC, consisting of N processors and N memory modules fully interconnected. (Equivalently, one may think of each module as being assigned to a distinct processor, and of each processor as being directly connected to every other processor.) In one MPC step, each processor can send one read/write request to any module, and each module satisfies one request arbitrarily selected among the incoming ones (if any). Thus, in order to guarantee efficient parallel access to a set of variables, we must ensure that the variables are well spread among the modules.
In order to distribute M > N shared variables among the N modules, we adopt the standard approach originally proposed in [19] , based on a Memory Organization Scheme (MOS) structured as follows. Each variable is replicated into r copies, r odd, stored in distinct modules, only a majority [r/2J + 1 of which need to be accessed to perform a read/write operation. Each copy is provided with a timestamp which is updated every time the copy is written, so that a majority of the copies is always guaranteed to contain at least one most recently updated copy. The distribution of the copies of the variables among the modules is governed by a bipartite graph G = (V, U; E), where V represents the set of variables, U the set of modules, and r edges connect each variable to the modules that store its copies. Suppose each processor issues an access (read/write) request for a distinct variable. (The case of fewer processors issuing requests can be handled with minor modifications, obtaining the same access time, where N is replaced by the actual number of requests. The case of multiple requests for the same variable also requires minor changes, and introduces only an additive logarithmic factor in the access time.) In order to satisfy the requests, the following protocol is executed by the processors, in parallel. The N processors are subdivided into N/r clusters, with r processors per cluster. Let P(i, j) denote the jth processor in cluster i, and let v(i, j) denote the variable it wants to access, for 1 < i < N/r and 1 < j < r. The protocol consists of r phases. In Phase k the processors of each cluster cooperate to access the variable requested by their kth companion. More specifically, processor P(i, j) is in charge of the jth copy of v(i, k), for any i and j. A number of iterations are executed. In each iteration every processor tries to access its assigned copy unless it previously succeeded, or other/r/2J + 1 copies of the same variable have already been accessed. Since a memory module can satisfy at most one request per iteration, the number of copies accessed in one iteration is equal to the number of modules receiving requests in that iteration. At any point during the execution of a given phase, a copy is said to be alive if it has not been accessed yet; a variable is said to be alive if fewer than [r/2J + 1 of its copies have been accessed, which implies, since r is odd, that at least Lr/2J + 1 of its copies are still alive. (This terminology is used only for variables and copies requested in the phase under consideration.) The code for the entire protocol is shown in Figure 1 . For each variable, a flag is used to indicate whether the variable is alive or not.
As we will see later, all the schemes presented in this paper use constant redundancy, and their implementations allows each processor to determine the physical address of any copy in O (log N) time. Thus, letting qb denote the maximum number of iterations of the while loop, executed in any of the r phases, it can easily be seen that the entire access protocol takes O((I) + log N) steps on the MPC. Next, we devise a general expression for ¢ based on the expansion properties of the graph G.
Let S C V be a set of variables. A c-bundle for S is defined as a subset of copies of variables in S containing at least c copies for each variable (similar terminology is used in [9] ). For a c-bundle r/of S, let F, (S) denote the set of modules storing the copies in ~/. The following lemma is similar to Lemma 3.3 of [19] . Proof The proof is by induction on k. Let k = 1. At the beginning there are X live copies, which means that there are at least X/r live variables. By using the expansion property of G, we conclude that the live copies reside in at least lz(X/r) 1-~ modules, and, therefore, after the first iteration RI <X-/z =X 1-copies are still alive. This establishes the basis. Assuming that the lemma holds for k -1, by a similar reasoning, we can show that
--Rk_l
Since Rk-i _< X, we have that ( Choosing Ry as an upper bound to ki and applying Lemma 1, we have
Hence,
The analysis of the MOSs presented in later sections is aimed at determining the expansion properties of the specific graphs used in such schemes, so that the above theorem can be used to obtain their time performance.
Definitions and Notations
Let q be a prime power and let n be an integer. Let lFq, denote the finite field with q~ elements, let ]Fq*n be its multiplicative group, and let ?, be a primitive element of Fq,. As is well known from Galois theory, (y) = Fq,, where (y} is the cyclic group generated by t', and, furthermore, the elements of Fq,, can be represented as polynomials in y of degree less than n, with coefficients in Fq. Throughout this paper, unless differently specified, lower-case roman letters are used to denote the elements of Fq, and lower-case greek letters to denote those of Fq,.
The Projective Linear Group of degree 2 over Fq, ( P G L 2 ( qn) ) is the group (under matrix multiplication) of 2 x 2 nonsingular matrices with entries in l~'qo, modulo its center, the group of scalar matrices (i.e., scalar multiples of the identity) [6] . In other words, matrices that differ by a scalar multiple represent the same group element. It is well known that It is easy to see that
Finally, we need to distinguish two particular subsets of Fq,. One is the set Pv of all the , ~",n-1 i.
polynomials in y with constant term equal to 0, that is, Pv = ILi=I c;y . ci ~ Fq}. Note that I P×] = qn-l, and we denote these polynomials as
The other set is that of monic polynomials in ~, of degree less than n. Note that there are (q" -1)/(q -1) such polynomials, which we denote as
{ qn--1}
~ri: o<_i < q-f .
Memory Organization Scheme for M E ®(N 15) Variables

The Graph
Let q be a prime power and let n > 3 be an integer, such that either q is even, or both q and n are odd. The graph G = (V, U; E) that specifies how the copies of the variables are distributed among the memory modules is defined as follows:
Thus, the variables are associated with the left cosets of H0 and the modules with the left cosets of Hn_~. By (1), (2) , and (3), it follows that
Therefore, for fixed q, we have M E ®(N15). The edge set is defined as follows:
,_~): A, B ~ PGL2(q n) and AHo fq BHn-I -7/= 0}.
We now show that the edges are in one-to-one correspondence with the cosets of the subgroup H0 fq Hn-i. It is easily seen that
thus, IH0 M H~-ll : q(q --1). By (2) and (3) 
and, for any B E PGL2(qn),
Proof The proof is similar to that of Lemma 2. For the first part, since L~-L ~ H,,_ l, Before proceeding with the analysis of the structure of the graph, we need a convenient representation for the nodes of U, given in the following lemma.
Proof It is not difficult to see that all of the above cosets are distinct and, therefore, since their number is (q" + l)((q" -l)/(q -1)), they form a partition of U.
[] The theorem below shows that the copies of any two variables share at most one memory module. In order to determine the expansion property of the graph, we need to introduce the function F z, defined as
We have 
Proof. By definition, F2(AHn-I) = F(F(AHn-I)) -AH,-I.
We are now ready to determine the expansion property of G. In particular, we want to find the values v and/z for which G has (v,/x)-expansion (see Definition 1) .
Proof. Let q-,
Combining the hypothesis IFo(S) l = Ivr[~e with the above inequality we obtain
Note that the above theorem holds for any prime power q. In particular we can choose q = 2. Assuming that the graph G can be implemented in such a way that a processor is able to compute the physical address of any copy in O(log N) time using constant internal storage, which we prove in the next section, we can combine the results of Theorems 4 and 1 and get 
Implementation
A crucial aspect of the design of an MOS concerns its implementation, an issue that has often been ignored in the past. In particular, a processor that wants to access a specific copy of a variable must be able to determine efficiently the module storing that copy and the physical address of the copy within the module. This subsection explains how this can be accomplished when the variables are distributed among the modules according to the graph G presented in the preceding subsection.
Let v0 ..... v~t-t denote the variables and u0 ..... uN-t the memory modules. Recall that M = qn-l((qZn _ 1)/(qZ _ 1)) and N = (q2n _ 1)/(q -I). We first need to associate variables and modules with the appropriate cosets. That is, we need to establish the following bijections:
The definition of the Ai 'S involves a number of technical details, which, for convenience of presentation, are dealt with in the Appendix. As for the Bj's, Lemma 4 already suggests a set of possible candidates, which, however, we need to modify slightly as follows. For nonnegative integers s < (q" -l)/(q -1) and t < qn + 1, we define the integer Recall that {~r~: 0 < s < (q" -1)/(q -I)} denotes the set of monic polynomials in F of degree less than n, and let Fq,, = {c~0 ..... Otq,_l }. For 0 < s < (q" -l)/(q -1) and 
J(s,t) zx s(qn +
1O < t < qn + l, define [j O] if t=O, BJ(s,t) =(7)
t)L~ -t Ho) with coset Bs~.~.t)L"h -I (14o N Hn-j).
We adopt the following convention: the kth copy of vi is stored at the address h in module n-1
us(s,t) if and only if aiL°(Ho n Hn-1) = Bj(s,t)L h (Ho N Hn-l), that is, if and only if n-1 ai L° ~ Bj(~,t)L h (14o n Hn-l).
Therefore, we seek a method to compute (s, t, h) from (i, k). To this purpose, a processor proceeds as follows:
Step 1. From i and k compute Ai and L °.
n-I
Step
Find s, t, and h such that AiL 0 E Bj(s,t)L h (Ho n Hn-l).
The following lemma identifies the matrices in each coset Bj(s,t)Lnh -1 (g 0 n gn-1). We are now ready to describe in more detail the two-step procedure that computes the physical address of a copy. We assume that each processor has a local work space consisting of a constant number of registers, and that, in addition to the ordinary arithmetic operations, it is able to perform addition, multiplication, and inverse in the fields ~q and ~q,. We also assume that arithmetic operations and operations in ~q take constant time. Representing the elements of ]~qn as polynomials, the operations in ~'q, can be implemented using feedback shift registers, each operation taking O (n) = O (log N) where x, y, z, and v are elements of Fq,, and the usual notation for matrices of P G L2 (q") is adopted, i.e., v ----1 or z v = 1 0. The procedure in Figure 3 , which is entirely based on n--I
Lemma 6. lft = 0, then B s(s,t)Lnh-l ( Ho n Hn-, ) = { [ao s
Otherwise (t > 0),
/[a:,
Bj(s,t)L h (no n Hn
Lemma 6, computes the indices s, t, and h such that AlL ° ~ Bj(s.t)L h (H0 f3 nn-1), = B rn--lez, that is, AiL°(Ho fq H,-l) j(s,t)l..~ h tr~ o 0 H,-1).
In the code given in Figure 3 , symbols ~, @, ®, and INV denote, respectively, addition, subtraction, multiplication, and inverse in Fq,. We also use the following four macros, all executable in O(log N) time. Let x denote a generic element of Fq,.
• INDEXI(x) = t, where x = ott.
• INDEX2(x) = h, where x ----Ph + b, for some b E ]Fq.
• INDEX3(x) = s, where x = azr.,, for some a c Yq.
• MONIC(X) -= 7rs, where x = aJr,.
We conclude with the following theorem, whose proof follows immediately from Lemma 6 and the above discussion. O(log N) time using O(1) internal storage.
Theorem 6. The code in Figure 3 correctly executes Step 2 in O(log N ) time. Therefore, a processor is able to compute the physical address of any copy of any variable in
Memory Organization Schemes for M ~ O(N 2) and M e O(S 3) Variables
We jointly present the MOS for M e O(N 2) and M e O(N 3) variables, because their underlying graphs are derived from the same graph G = (V, U; E) defined below. Let 
V = PGL2(qn)/Ho, U = PGL2(qn)/Hn.
By (1), (2) , and (4) 
We have (q"-l)/(q-I)--I q"-I -1 s=O t=O and, forany B E PGL2(qn),
To prove equality (on the basis of cardinalities), we must prove that the cosets of the right-hand set are distinct. Suppose
L T,.t,(H o N Hn) = Ln2,,2 (n0 N H~).
Then, there is a matrix such that
Lsnt2= [~2 7rs21Pt 2]_~_ I~, 7rS, lPt, ] [; b]= [a;s, b~s,-F-Tgs, Pt,]
1 which implies rrs2 = azrs~ and zr~pr~ = rr~, (pt~ + b) . Since rr,, and rr~ 2 are monic polynomials and Pt,, Pt: E Pv, we conclude that sl = s2 and tl = t2.
[] Thus, the degree of each node of V is q + I and the degree of each node of U is qn-l((q~ _ 1)/(q --1)).
With an argument similar to the one used to prove Lemma 4, it can easily be shown that In Theorem 4.10 of [ 10] it is shown that G is a 3-(q~ + 1, q + 1, 1)-design, which has the property that for any distinct Ul, u2, u3 ~ U there exists exactly one v ~ V adjacent to all three of them. (The parameters q~ + 1 and q + 1 indicate the output size and the input degree of the graph, respectively, whereas the parameters 3 and I indicate that for any three outputs there is one input adjacent to them.)
Memory Organization Scheme for M ~ O(N 2) Variables
The Graph. A Balanced lncomplete Block
Design with parameters qn, q, and 1 ((qn , q, 1)-BIBD) is a bipartite graph with qn outputs, input degree q and such that for any pair of outputs there exists exactly one input adjacent to both. This immediately implies that there are q"-l ((qn _ 1)/(q -1)) inputs and that the output degree is (qn -1)/(q -1) .
Let N = qn and M = qn-l((qn _ l)/(q -1)) and note that, ifq ~ O(1), M E ®(N2).
We use a (qn, q, I)-BIBD to distribute M variables among N memory modules, with each variable represented by q copies and each module storing (q" -1)/(q -1) copies of distinct variables. We call such a graph Gl = (V1, U1; E~), where the set VI denotes the variables and UI the modules. We will see in the next subsection how Gl is obtained, using a standard technique, as a subgraph of the 3-(q n + 1, q + 1, 1)-design G defined earlier.
The following theorem establishes the expansion properties of G i. As usual, given a set of variables S, and a c-bundle r/for S, let Fo(S) denote the set of modules storing the bundle copies. 
such copies. Clearly, these copies belong to distinct variables, which cannot share any module other than u because, otherwise, we would have more than one variable connected to the Same pair of modules, violating the BIBD property. Since each such variable accounts for at least other p bundle copies, beside the one stored in u, we conclude that
IF,~(S)I > IS] 1/2 (p -t-1)p,
/.z which, combined with the hypothesis IF0(S) t = ISI U2/z, yields
In the next section we show how G l can be implemented for any prime power q and integer n, so that a processor is able to compute the physical address of any copy of any variable in O (log N) time using constant internal storage (Theorem 9). Fixing q = 3 and combining the results of Theorems 7 and 1, we get TheoremS. M c O(N2) variablescanbedistributedamong N processorsofanMPC, with redundancy 3, so that any set of N distinct variables can be accessed, using the access protocol of Section 2, in time 0 ( N 1/2) . Moreover, each processor can determine the physical address of any copy in 0 (log N) time using 0 ( 1 ) internal storage. Implementation. A (q", q, I )-BIBD Gl = (Vl, Ul; El) can be obtained as a subgraph of the 3-(q ~ + 1, q + 1, 1)-design G = (V, U; E), using the following standard technique. Let u be an arbitrary node of U and define where E/~v,. u~ denotes the edges of E between Vj and U1. As shown in Theorem 1.14 of [10] , the graph G1 = (V1, U1; El) is a (qn, q, 1)-BIBD, with
5.1.2.
where each node in Vl has degree q, and each node in Ul has degree (qn _ 1)/(q -1).
For convenience, we choose u to be the node corresponding to coset H,. The following lemma identifies the cosets associated with the nodes in Vj and Ui, according to the above construction. Proof Immediate from the definition of UI and I,'1, Lemma 7, and equality (10) . [] The edge set E1 is, by definition, a subset of E consisting of all those edges incident to both Vl and Ul. For each node x in G j, denote the set of its neighbors by I'(x). Lemma 
9.
given in Lemma 8. Then
Let AHo ~ VI and BHn E UI,for some A, B ~ PGLz(q ~) ofthe kind (11) . (12) Proof. Equality (11) follows immediately from (9) , due to the exclusion of Hn = AL°_lHn from Ul. To prove (12) is given by the lexicographic order of the n-tuples of coefficients. By virtue of Lemma 9, we can establish that the kth copy of a variable vi is stored in the module associated with coset Ai L ° Hn, 0 < k < q. The item stored at the address h of module j is a copy of the variable associated with coset BjL~h,oHo . Since the edges in the graph are cosets of H0 71Hn we adopt, as before, the following convention: the kth copy of
vi is stored at the address h in module u j if and only if A i L ° ( HoN H,) = Bj Lnh,o ( HoN Hn), that is, if and only if Ai LO E BjL~,o(Ho (3 Hn).
Suppose a processor wants to compute the address of the kth copy of vi. A two-step procedure, similar to the one used in Section 4.2, is executed:
Step 2. Find j and h such that AiL ° ~ BjLnh.o(Ho fq Hn).
It is easy to see that Step I takes O (log N) time. Consider Step 2. We have that
NotethatAiL° isof theform [xl ~]
, where x and y are elements of Fq,. The processor has to determine the indices j and h such that
for somea, b c Fq anda -¢ 0. This implies b = 0, and, therefore, ot 2 = x and (aTrh) -1 ~---y, which allows a processor to determine j and h in O(log N) time. We have Theorem 9. A processor computes the physical address of any copy in 0 (log N) time using 0 ( 1 ) internal storage.
Memory Organization Scheme for M ~ O(N 3) Variables
5.2.1. The Graph. The graph that we use in this case is a 3-(q n + 1, q, q -2)-design with q > 3, which is a bipartite graph G2 = (V2, U2; E2) such that [U2I = qn + 1, each node in V2 has degree q, and for any three distinct nodes u l, u2, u3 6 U2 there exist exactly q -2 nodes of V2 adjacent to all three of them. The properties of such a graph are similar to those of the graph G defined at the beginning of Section 5, which is a 3-(q n 4. 1,-1, 1)-design, and would yield a comparable time performance for the MOS. However, G2 allows us to devise a simpler implementation. From its definition, it immediately follows that I V2I = q ~-l ((q2~ _ 1 ) / (q -1)), and that the degree of each node of U2 is q~ ((q~ -l)/(q -1) ). Thus, we use G2 to distribute M = qn-l (( 
variables among N = q" 4, 1 modules, with each variable replicated in q copies, and each module storing q" ((qn -1)/(q -1)) copies of distinct variables.
Observe that for q = 3 the graph is trivial because M = (3) and the nodes of V2 are in one-to-one correspondence with the triplets of nodes of U2. In this case the expansion property of the graph, proved below, does not hold, and this is why we require q > 3.
Theorem 10. For q > 3, G2 has (2, q2/3/3) .expansion. Proof. Let p = [q/2J. We must prove that, for any set S ~ V and any (p + l)-bundle forS, q2/3
Without loss of generality, suppose IF,(S)I = ISll/3/z for some/z > 0. Note that if q > 3, then p > 2. Since r/contains at least ISI(P + 1) copies, there must be a module u ~ Fo(S) storing at least
bundle copies, which clearly belong to distinct variables. Let S' c_ S be the set of these variables. Each such variable accounts for at least other p bundle copies, beside the one stored in u, which, in turn, account for at least (~) pairs of modules, not including u. By adding u to each pair, we obtain (~) triplets. Let
There are (t2) triplets formed by two modules in Fo(S') -{u} and u. By the definition of a 3-(q n + 1, q, q -2)-design, each triplet occurs exactly q -2 times, and, therefore, we conclude that
Recalling that IS'l ~ ((p + l)/lz)lSI 2/3, the above relation implies
Since IFn(S)I ~ IFo(S')I > t, combining the inequality for t with the hypothesis IFn(S)I = ISp1/a/z, we obtain
The next section shows how to construct G2 for any prime power q > 3 and integer n, and how the copies of the variables can be organized among the modules according to G2, so that a processor is able to compute the physical address of any copy in O(log N) time using constant internal storage (Theorem 13). Choosing q = 5 and combining the results of Theorems 10 and 1 we obtain 
5.2.2.
Implementation. We construct G2 by combining qn + 1 copies of the graph GI, studied in Section 5.1, as explained below. Recall that G1 was derived from G = (V, U; E) by choosing an arbitrary node u • U and considering set U -{u} as the output set, set {v • V: (v, u U.j =U-{uj},
The following lemma generalizes the result of Lemma 8.
Lemma 10. We have
V.,= Jr t Ho: O < s < --and pt ~ P× ,
--q-1 (13) and, for 0 < j < qn,
qn _ 1
Proof Equalities (13) and (14) follow from Lemma 8, since Vu_~ = I/'1 and U._~ = U1. Equality (15) follows from Lemma 7 and the definition of Vuj. As for U.j, by using (10) , it is easy to see that 
U2 l
For any j > 0 the graph Gut is isomorphic to G,_,, since the cosets associated with the nodes of Gut are obtained by multiplying those of G,_, by the same matrix, and the adjacencies are preserved because the edges, which are associated with the cosets of H0 I"1 H~, are also multiplied by the same matrix. Furthermore, since G,_, = G1, we conclude that each Gut is a (q~, q, I)-BIBD.
We are now ready to construct the 3-(q ~ + 1, q, q -2)-design G2 = (V2,/-/2; E2). Recall that [U2I ---qn + 1 and I V2I --qn-I ((qen _ 1)/(q -1)). Set U2 ~---U, and partition the set V2 into q" + 1 disjoint subsets of qn-I ((qn _ 1)/(q -1)) nodes each, namely V] = Vu~, for -1 < j < q". Note that the V,~ 's are not disjoint, therefore distinct V2 ~ 's may include the same cosets which, however, will be reckoned as distinct nodes. Each V~ is connected to U by the same edges that connect V,j to U,~ in G,j (see Figure 4) .
It is easy to see that, in G2, each node of V2 has degree q and each node of U2 has degree q' ((q 
Proof. We must prove that for any three distinct nodes x, y, z c /-/2, there are exactly q -2 nodes in V2 adjacent to all three of them. Fix a triplet (x, y, z) and consider these nodes in the graph G. Since G is a 3-(q" + 1, q + 1, 1)-design, there is exactly one node v ~ V adjacent to x, y, and z. Since the degree of v is q + 1, there are other q -2 nodes of U adjacent to v in G. Call these nodes v/, for I < i < q -2. Thus, for any i we must have that v ~ V~, x, y, z ~ Uv, and, clearly, (v, x), (v, y), (v, z) ~ Ev~. The same edges occur in G2 and, therefore, we have found q -2 nodes of V2 adjacent to x, y, and z in G2. Since each node of V2 accounts for (q) triplets, we must have
The theorem follows by observing that the above is, in fact, an equality.
[] The implementation of G2, for the purposes of address computation, is easy, once we regard this graph as decomposed into the Guj's. As observed before, each Guj is isomorphic to Gt and can be implemented as explained in Section 5. The variables are subdivided in groups according to the partition of V2 into the subsets V~, -1 < i < qn. Suppose a processor wants to compute the physical address of the kth copy of a variable v, and suppose v belongs to Vg. We first compute the address of the copy in Gu,, with the same procedure described before for GI. Suppose the copy resides in module uj. Then, once we have the address within uj with respect to the graph G,,, we just add the appropriate multiple of (q~ -1)/(q -1) to account for the number of blocks in u] preceding the one where the copies of the variables of V~ are stored. The following theorem easily follows.
Theorem 13. A processor computes the physical address of any copy in 0 (log N) time using 0 ( 1 ) internal storage.
The choice of the representatives for PGL2(2n)/Ho is made much simpler if we regard the rows of the matrices of PGL2 (2 n) as elements of extension field 1F22,, as :g explained below. Let X be a generator of the multiplicative group 1F22,,, and define 2 2n --I O = 3
We have IF2"2 -=-{X/P: 0 < i < 3}, so w = X p is a generator for F2"2. Note that IF22 C 1F22. but, since n is odd, IF2= ~Z IF2,,, as pictured in the diagram below. where, since the matrix is nonsingular, both qb(x, y) and qS (z, v) are nonzero• From now on, either the usual matrix notation or the above pair notation is used wherever appropriate, intercbangeably.
• g¢
Partition the elements of F2*, (i.e., ,k i, for 0 _< i < 2 2" -t ) into cosets of Y2:,,/F22" For0<i <plet [X i]={Xiwk: 0<k <3}= {xi+kP: 0<k < 3}.
We need to identify the cosets that contain the elements of IF2*. Define The following lemma shows that the partition of 1722, into the cosets of 172"/172" is, in some sense, preserved among the c~i's. Suppose otl 6 [)k], for some k, 0 < k < p. Before proceeding with the choice of representatives of PGL2(2~)/Ho we need a technical fact. For 1 < i < (2 n-l -1)/3 and 0 _< j < 2 n -l define
Fact 1. For any 1 < i < (2 "-I -1)/3 andO < j < 2 ~ -1 we have: Since i' < p, the above equation can be rewritten as (17) , (j -j')cr rood p = hr for some h, 0 < h < 2 n-l.Sincei < r,i+hr < (2 n-l)r=p.Thuswehave i'=i +hr.
The condition i' < r implies h = 0 and, hence, i = i'. Also, h = 0 implies (j -j')a = dp, for some 0 < d < 2 n -1. Now, since n is odd, lcm(a, p) = 2 2n -1 and, thus, (j' -j)a = dp implies that (j' -j) is a multiple of lcm(a, p)/a = 2 n -1. However, by definition, (j' -j) < 2 n -l, so we must have (j' -j) = 0, that is, j' = j. 
Lrl2J. []
We are now ready to define the matrices representatives of PGL2 (2" )/Ho. The matrices are given in the pair notation and, for convenience, are partitioned into four sets, L~, L2, L3, and L4. [] The next theorem shows that the matrices in the above four sets are indeed a set of representatives for PGLz (2" with the matrices of E bijectively. We first explain how, given an integer r, 0 < r < M, the rth matrix of £, say Ar, can he computed in the pair notation. Assume that a primitive element ~. e ~22, is known.
Recall that/2 = Ll tO L2 k.J L3 to L4 Thus we must find a bijection between the indices in each range and the matrices of the appropriate set. For the first three ranges, the mapping can he easily established based on the definition L~, L2, and L3, and involves only a constant number of operations. The case of L4 is slightly more complicated. Suppose (2" -I) 2 < r < M and set r' = r -(2 n -1) 2, so that 0 < r' < (2 ~ -1)((2 ~-l -1)/3)(2 n -3). Ar will be of the form
Ar = (~k(i,O), ~twS )
for some i, t, and s such that 1 < i < (2 "-1 -1)/3, 1 < t < /9, r ]/ t, 0 < s < 3, and ~ k{i,0)(Z' w'~) -1 ¢ ~'2°. We must associate r' with the appropriate triplet (i, t, s). The that is, t' ~ i + ja for any 0 < j < 2 n --1. Among the integers 0---22n -2, those that are multiples of r or are equal to i + jcr (i.e., the "forbidden" indices), occupy fixed positions, as shown in Figure 5 where the integers 0 ..... 2 2n -2 are arranged consecutively into 2 n -1 rows and ~r columns. Note that the multiples of r are those in columns 0, r, and 2r, and the values i + fir, with 0 _< j < 2 n -1, are those in column i. It is not hard to see that t' can be computed with a constant number of operations. Once the matrix Ar is known in pair notation, we need to transform it into the usual form. Specifically, let Ar : (~i, ~j), for some i and j. We want to find oti, fli, or j, ]~j E ~2" such that ajw + flj = U.
Suppose we know that ~. = oq w + fll (each processor has to store the two values a, and flj ). Then given i and j and using the fact w 2 -----w + 1, oti, fli, otj, and flj can be easily computed with O(n) operations over F2.. Recalling that n E O(log N), where N is the number of processors in the MOS, we have proved
