and (c) the choice between DiriB, Coarse Vector, and Gray-so fiware depends on whether one wants to optimize for few sharers (DirZB), many sharers (Coarse Vector), or hedge one's bets between both alternatives1 Introduction
This paper considers medium-scale parallel computers, which we define as having 32 to 128 processors. Small-scale machines differ from medium-scale ones because they can have centralized resources (e.g., main memory) and are often designed primarily to run independent serial programs.
In contrast, large-scale machines must use distributed resources (e.g., processor-memory nodes) and are designed for asymptotic scalability, which may compromise performance on small versions of these systems. Medium-scale machines fall in between. They probably use processor-memory nodes to avoid the bottlenecks of small-scale machines, but they may occasionally use unscalable solutions-such as broadcastsavoided by large-scale machines. Of course, others might pick different numbers for the exact boundaries of medium scale.
We expect that many medium-scale computers will support cache-coherent shared memory in hardware. Relative to message-passing multicomputers, hardware shared memory makes it easier to provide operating system support for multiple users, is a more straightforward target for automatic parallelization of serial programs, and allows programmers of explicitly-parallel programs to use pointers and ignore per-processor memory limits. Per-processor caches reduce average memory latency and bandwidth demand when some locality is present. Hardware cache coherence makes the caches functionally invisible so
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
that compilers and operating systems can optimize for common cases rather than managing worst-case data sharing. For these reasons, we assume cachecoherent shared memory in this paper.
Many protocols have been proposed for implementing cache coherence.
We assume that medium-scale computers are too large to rely on snooping a shared bus [2] but small enough that they need not be concerned about asymptotic scalability [10, 12] . Gray-hardware works exactly like Tristate except that processors are enumerated using a binary-reflected gray code, so that consecutive processor numbers differ by one bit.
Gray-so flware uses the same hardware as Tristate but shows how software can redistribute the work so that neighboring work is assigned to processors whose numbers differ in only one bit.
Finally, Home uses gray-coded processor numbers like Gray-hardware, but has a sharing code of only logN bits, where the j-bit is set if the j-th bit of any sharer differs from the j-bit of the home node number. Protocol features are summarized in forms best of four with the same hardware as Tristate . For ocean, barnes, and appbt, respectively, Gray-so flware sends 1.0, 1.3, and 4.7 times as many invalidation messages as DirN. The barnes number is large due to a high degree of dynamic sharing. Coarse Vector performs better for barnes but worse for the other two applications, while Dir2 B and DirdB perform very poorly whenever there are more than two or four sharers (as is to be expected).
Thus, the choice of protocol between Dir%B, Coarse Vector, and Grag-sofiware will depend on whether one wants to optimize for few sharers (DiriB), many sharers (Coarse Vector ), or hedge one's bets between both alternatives ( Gray-sofiware).
We see two key contributions for this paper. First, we introduce three new protocols-Gray-hardware, In the simplest case, one can form a multi-dimensional gray code by concatenating gray codes from each dimension.
The multi-dimensional gray code for an N = 24 x 24 x 28-node mesh uses 16 bits-4 from the first index, 4 from the second, and 8 from the third.
Forming a multi-dimensional gray code is more complex, however, if most dimensions are not powers of two.
In general, the problem is equivalent to the following graph embedding problem: 
New Protocols
Here we discuss specific implementation issues related to the three new protocols-Gray-hardware, Gray-s ofiware, and Home.
2.2.1
Gray-hardware Table 3 gives raw invalidation message counts for most runs presented in this section. Figure 3 is an example of a graph triple we will use several times. The horizontal axis shows the number of processors, while the vertical axis shows the total number of invalidation messages with a protocol divided by the total number of invalidations for DirN.
4.1
DirN and Dir113 Table 3 shows that for DirN the number of invalidation messages sent grows with the number of processors. For ocean, the increase is roughly a factor of four, when we double the number of processors. The first factor of two comes from having near-neighbor sharing of twice as many boundary elements, because columns are now divided between twice as many processors. The second factor of two occurs because using more processors maps less data to each perprocessor cache. Data not replaced by finite cache effects must instead be recalled with invalidation messages. When we double the number of processors and halve the cache size-not shown-ocean's invalidations just double.
For appbt, the number of invalidation messages grows with the number of processors because although the sharing pattern distribution does not change, we have increased number of boundary elements because the same 3D grid is divided into greater number of processors. For barnes, the frequency of many dynamic sharers increases with sors, the number of invalidation messages for Dirl B is five to ten times DirN, while for 128 processors, the invalidation messages blow up to 40 times DirN for ocean and appbt. In ocean, sharing is predominantly between two neighboring processors, while in appbt it is primarily between a maximum of three processors.
Since the number of sharers does not increase with the number of processors, Dirl B sends more invalidation messages than necessary for a greater number of processors. In barnes, the frequency of many dynamic sharers increase with the number of processors.
As a result, DirlB sends fewer unnecessary messages relative to DirN, resulting in a 15-times increase for 128 processors.
Tristate
The question now is-can the multicast protocols get close to DZriv with much less state? Figure 4 displays the answer. Note that the vertical axis in this figure extends to 10 rather than 50, as in Figure 3 . Figure 4 shows that Tristate is successful in keeping the invalidation message count closer to DirN. Unlike DirlB the number of messages does not grow rapidly with increasing number of processors. For 128 processors, Tristate results in less than four and two times the invalidation messages of Dirjv for ocean and appbt, respectively. Results are relatively good, because these benchmarks have a low degree of sharing for which Tristate is optimized.
Interestingly, for a dynamic benchmark like barnes with a possibility of random sharing patterns which could degrade the performance of Tristate, the invalidation messages are within a factor of five more for 128 processors. It ap- DirI B and DirN relative to DirN itself for the same number of pears that sharing in barnes is not completely random in practice, and that the sharers are largely consecutive processors.
4.3
Gray-hardware, Gray-software, and Home
Gray-hardware improves upon Tristate when neighboring processors are involved in shiiring (Section 2.2). This effect is predominant in ocean (Figure 4) , where two consecutive processors share a column (Section 3). Gray-hardware reduces the number of messages sent by Tristate by a factor of two to three and is almost identical to the number of messages sent by DirN. For appbt, sharing is between neighboring processors in three dimensions (Section 3). Since
Gray-hardware is targeted towards sharing in one dimension, it does not show any spectacular improvement over Tristate in this case. The improvement is about 4~o over Tristate for 128 processors. In barnes, we have two effects -(a) the frequency of many sharers grows with the number of processors, and (b) the sharing pattern is dynamic. These imply that the sharers might not always be neighboring processors. Figure 4 shows that the improvement is roughly 8% for 128 processors.
Gray-so fiware sends almost the same or fewer invalidation messages than Gray-hardware. Thus, the extra hardware for gray coding and taking its inverse can be eliminated.
For ocean, Gray-so&ware is almost identical to Gray-hardware because both the protocols use one-dimensional gray coding. The results are more interesting for appbt, where threedlmensional gray coding is achieved in software, which exploits the 3D near-neighbor sharing pattern of the benchmark.
Here for 12/3 processors, Gray-hardware sends 79% more invalidation messages than DirN.
Gray-so fiware closes almost two-thirds of this gap to use only 30% more invalidation messages than DirN.
For barnes, there was no direct way to The results are similar to ocean, in that there is almost no difference in the invalidation messages with Gray-hardware.
Home uses the same number of bits for the sharing code as in Dirl B by using the home node number of a block as its reference number to do the encoding.
Home can perform as well as Tristate or
Gray-so fiware if data is placed so that the home node is one of the sharers, but Home will perform worse otherwise.
Since we did not control data placement, Figure 4 displays the latter case. Results show Home should not be used when data placement is not controlled.
Coarse
Vector, Dir2B, and Dir4B Figure  5 displays the results for Coarse Vector, Dirz B, and Dir4B versus the just-discussed Gray-software.
To be fair, we use the same number of bits for the sharing code of Coarse Vector as in Gray -sofiware-2 x logN. For regular applications with well-defined sharing patterns and low number of sharers like ocean and appbt, Coarse Vector is worse than Gray-so./lware ( Figure 5 ), and the difference grows with increasing number of processors. For 128 processors, the deterioration is around a factor of four for these benchmarks.
However, for dynamic sharing patterns like in barnes, with a large number of sharers, Coarse Vector shows a slower degradation rate with increasing number of processors, and is consistently better than Gray-s ofiware ( Figure  5 ). We found that for barnes (not shown), Coarse Vector is worse than Gray-so flware when the number of sharers equals two. But it becomes progressively better than Gray-so flware as the number of sharers increase. Even though Coarse Vector does better than
Gray-sojlware for barnes, both perform much worse than Dirm due to the high degree of sharing. We assume notifying protocols and no special network support for broadcasts or multicasts.
We measure performance using the total number of invalidation messages rather than total execution time to focus on how the protocols differ to avoid having to vary network topology and link capacity assumptions. Coarse Vector performs worse than Gray-software for ocean and appbt that have few dynamic sharers, but better for barnes that more frequently has many dynamic sharers. Thk "more stable" behavior of Coarse Vector occurs, because it never sends more than (K -1) x z unnecessary invalidation messages for z sharers with each bit representing 1{ processors. Not surprisingly, DiriB is less stable than both Coarse Vector and Gray-so fiware, because it sends N messages when there are more than z sharers. This rarely occurs in ocean, occurs significantly in appbt for Dir2B but not for Dir4B, and occurs significantly in barnes for both DirsB and Dir4B.
Thus, the choice of protocol between DirzB, Coarse Vector, and Gray -software will depend on whether one wants to optimize for few sharers (DiriB), many sharers (Coarse Vector), or hedge one's bets between both alternatives ( Graysoflware).
The scope of any experimental study is finite. Our study compares directory protocols that have very similar implementations.
Specifically, we examined implementations that differ primarily in how the sharing code is encoded. We chose to exclude protocols that use traps [5, 11] , distributed directories [10], directory caching [9] , and several other optimization [13, 14] , because setting the plethora of implementation assumptions needed for these alternatives would have compromised the generality of our study. We did not examine DiriNB because it performs poorly without a special mechanism for handling read-only data [17] . Nevertheless, in some situations, the protocols we did not study may perform better than the ones we did study. 
