We demonstrate the bene ts of software shared memory protocols that adapt at run-time to the memory access patterns observed in the applications. This adaptation is automatic | no user annotations are required | and does not rely on compiler support or special hardware. We investigate adaptation b e t ween single-and multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation b e t ween invalidate and update. Our results indicate that adaptation b e t ween single-and multiple-writer and dynamic page aggregation are clearly bene cial. The results for the adaptation b e t ween invalidate and update are less compelling, showing at b e s tgains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.
I. Introduction
Many di erent protocols have been proposed for implementing a software shared memory abstraction on distributed memory hardware. The relative performance of these protocols is application-dependent: the memory access patterns of the application determine which protocols exhibit g o o dperformance. It is therefore appealing to build a system with multiple protocols, and let the system choose b e t ween the di erent protocols based on the access patterns it observes in the application. In this paper we present the design of such an adaptive software distributed shared memory system and evaluate its performance.
Speci cally, this paper focuses on protocols that implement the lazy release consistency (LRC) memory model 16] . We furthermore assume that shared memory accesses are detected using virtual memory protection. This paper explores the bene ts of LRC protocols that adapt to the memory access patterns of the applications, by comparing their performance to non-adaptive versions of the protocols. In particular, we investigate: 1. adaptation b e t ween single-and multiple-writer protocols, including adaptation to migratory access patterns, 2. dynamic aggregation of pages into larger transfer units, and 3. adaptation b e t ween invalidate and update protocols. The adaptations considered in this paper are triggered automatically: the run-time system detects certain access patterns and switches b e t ween protocols accordingly. This automated adaptation distinguishes our work from so called multi-protocol software shared memory implementations (e.g. 8]), in which the user has to annotate the program to select the appropriate protocol. In our experience, removing the need for annotation leads to much improved usability.
The adaptive protocols were implemented in TreadMarks 1]. Our experimental platform is a switched 100Mbps Ethernet consisting of eight 166Mhz Pentium Pro machines running FreeBSD. We use eight applications to demonstrate the performance of the adaptive protocols: 3D-FFT, CG, MG and IS from the NAS b e n c hmark suite 4], Water and Barnes-Hut from the SPLASH b e n c hmark suite 22], Gauss from the TreadMarks distribution, and ILINK from the FASTLINK package 21]. The results indicate that: 1. Adaptation b e t ween single-and multiple-writer and dynamic aggregation perform well, in some cases showing substantial performance improvement, and never decreasing performance. 2. Adaptation b e t ween invalidate and update is less successful, with performance improvements that match dynamic aggregation in some cases and substantial p e r f o rmance losses in others.
The outline of the rest of this paper is as follows. Section I Ipresents the necessary background information about LRC. Section I I Ipresents the possible protocol choices for implementing LRC, and the policies and mechanisms by which the adaptive protocols choose b e t ween their alternatives. Section IV describes the experimental environment. Section V describes the applications used. Section VI presents the results of the performance comparison. Section VII discusses related work. Section VIII presents our conclusions.
I I .Programming Model
We assume an explicitly parallel programming model, with primitives for process creation and destruction, synchronization, and shared memory allocation and deallocation. Synchronization primitives include mutual exclusion locks and barriers. Shared memory is accessed through load and store instructions. The memory consistency model presented to the user is release consistency (RC), a relaxed memory model 13].
In RC, ordinary shared memory accesses are distinguished from synchronization accesses, with the latter category subdivided into acquire and release accesses. Lock synchronization maps onto acquires and releases in the obvious way: a lock operation corresponds to an acquire, and an unlock corresponds to a release. With barriers, a barrier arrival corresponds to a release, whereas a barrier departure corresponds to an acquire. Roughly speaking, RC requires that, b e f o r ea release by a processor p becomes visible to another processor q, all ordinary shared memory modi cations by processor p become visible to processor q. The Lazy Release Consistency (LRC) algorithm 16], one of the possible RC implementations, delays the propagation of shared memory modi cations by processor p to processor q until q executes an acquire corresponding to a release by p.
Programs without data races, i.e., programs with sufcient synchronization such that any pair of con icting memory accesses is separated by a release-acquire pair, produce the same results on an RC or an LRC memory system as on a conventional sequentially-consistent memory system 17]. Performance, however, can b emuch improved by the use of RC or LRC, especially for software implementations of shared memory, because the messages propagating the shared memory modi cations can b edelayed and coalesced with the synchronization messages, leading to a substantial reduction in communication 8], 16]. In addition to b e i n gdata-race-free, all synchronization in the program must b edone through the primitives supplied by the runtime system, so that it can take the required consistency actions at synchronization p o i n ts.
We assume that the shared memory is implemented as a global virtual memory segment, shared by all processors. The virtual memory protection hardware is used to detect access to individual pages. Some of these accesses may cause page faults, which then trigger protocol operations as described in the next section. In a single-writer protocol, there is a single writable copy of a page at any given time 14]. The processor currently holding the writable copy of a page is called the owner of that page. Several read-only copies of the page may coexist with the writable owner copy. According to the denition of RC, a read-only copy may b etemporarily inconsistent with the writable copy, but it must b ebrought upto-date when the processor on which it resides synchronizes with the owner. Assume, for instance, that an invalidate protocol is used and synchronization is by means of a barrier. If the owner has modi ed a particular page, it creates an owner write notice for that page, containing its processor id and a version numb e r .The barrier protocol causes the owner write notice to b etransmitted to the processors which have read-only copies of the page, and these processors then invalidate their copies. On a subsequent access miss, they retrieve the page from the owner. The owner write-protects his copy during the rst retrieval. Before a processor may write on a page, it must obtain ownership from the current owner. The owner is located by means of the write notice with the highest version numb e r ,possibly by forwarding if ownership has changed since this write notice was received. Once ownership is obtained, the page's version numb e ris incremented by one.
In contrast, in a multiple-writer protocol, there may b e several writable copies of a page on di erent processors 8]. Each processor with a writable copy records its own modi cations to the page by a technique called twinning and di ng. Pages are initially write-protected so that the rst write access to a page causes a protection violation. At this p o i n t, the system makes a copy of the page, the twin, and unprotects the original page. To detect what modi cations have been made to a page, the current copy is compared word-by-word to the twin, and a record of the modi cations, the di , is constructed. Continuing the above example, when an invalidate protocol is used and synchronization is by means of barriers, each processor that has modi ed a page constructs a write notice for that page, which is forwarded by the barrier protocol to all processors with copies of that page. 1 A processor might receive several write notices for a single page. These write notices cause the page to b einvalidated. On an access miss, the di s corresponding to these write notices have to b eretrieved and applied to the processor's current copy of the page.
The tradeo b e t ween single-and multiple-writer protocols is dependent on the access pattern to the page, and a ects both execution time and memory overhead. If multiple processors write concurrently to di erent parts of a page (write-write false sharing), then multiple-writer protocols achieve better performance, because they do not incur the cost of transferring the page over the network to the next writer. Even if there is only a single writer, it may b eadvantageous to use twinning and di ng. This scenario occurs when the writer modi es only a small portion of the page. The multiple-writer protocol transmits only those modi cations, while a single-writer protocol transmits the entire page.
If, however, only a single processor writes to a page at any given time, and this processor modi es a large part of the page, then the single-writer protocol avoids the cost of twinning, di ng, and di application, without much increase in communication. More importantly, it avoids a pitfall of the multiple-writer protocol, called di accumulation 18], a scenario in which a numb e rof partially or completely overlapping di s are transmitted, signi cantly increasing the amount of communication. While it is p o s s ible to modify the multiple-writer protocol to eliminate the overlap, there is a high computational cost to pruning useless data from older di s each time a new di is created. It is more e cient to manage the page in single-writer mode.
Finally, while the memory overhead for the single-writer protocol is negligible, the multiple-writer protocol has to allocate memory for the twins and the di s. This extra overhead may cause an application to page to disk with a multiple-writer protocol, while running in memory with a 1 The information in the write notices of the multiple-writer protocol is more complicated than the version numb e rpresent in the owner write notices of the single-writer protocol. In particular, it contains a vector timestamp that allows the write notice to b epartially ordered w.r.t. write notices from other processors.
single-writer protocol.
A.2 Adapting b e t ween Single-and Multiple-Writers
In the adaptive protocol used in this paper, all pages start out in multiple-writer mode. A page may switch to single-writer mode by one of two events: 1. A processor receives a di request for a page, and it has modi ed the entire page. In this case, the page is clearly single-writer, and there is no reduction in communication by sending a di . 2. A processor sends out di requests for a page, it receives no concurrent di s, and the sum of the sizes of the di s received is bigger than the page size. This is indicative of the di accumulation phenomenon discussed earlier.
Since there are no concurrent di s, there is no write-write false sharing. Looking ahead to the time where a di erent processor requests the di s for this page, keeping the page in multiple-writer mode would cause more data to b esent than a page. It is therefore more e cient to put the page in single-writer mode.
A page may switch back to multiple-writer mode at the onset of write-write false sharing, which is detected by the ownership refusal protocol, a modi cation to the single-writer protocol for locating and transferring ownership 2]. On a release (an unlock or a barrier arrival), a processor communicates both its owner write notices and its multiple-writer write notices. On a write fault to a page in single-writer mode, a processor requests ownership, as in the single-writer protocol: The owner is located, using the owner write notice with the highest version number. This version numb e ris included in the ownership request message. If the recipient of the message is no longer the owner, or if the version numb e rhas changed, write-write false sharing has b e e ndetected, the ownership request is refused, and the page is put into multiple-writer mode. Otherwise, ownership is granted, the old owner write protects its copy of the page, the requester becomes the new owner, the version numb e ris incremented, and the page stays in single-writer mode.
The essential aspect that needs to b eunderstood about this protocol is that it correctly detects the presence or absence of write-write false sharing. Consider the example of a data item protected by a lock, and assume that there is no write-write false sharing on the page containing that data item. When processor p acquires the lock, it receives the owner write notice from the previous owner q with version numb e rV . When p writes on the page, it incurs a page fault, and it tries to achieve ownership. It sends an ownership message to q, including the version numb e rV . By our assumption that there is no write-write false sharing on the page, no other processor has attempted to write on the page, and therefore q is still the owner and the page version's numb e ris still V . Therefore, the ownership is granted, and p becomes the new owner. Consider next the case where there is write-write false sharing on the page, either because q or some other processor wrote on a different part of the page. If q wrote to the page, it must have re-acquired ownership of the page, and thus it must have incremented the version numb e r . If a di erent processor wrote to the page, it must have acquired ownership, and q is no longer the owner. In either case, p's ownership request is refused, and the page is put in multiple-writer mode. For a more detailed description and a correctness argument, we refer the reader to Amza et al. 2].
A.3 Adapting to Migratory Access
Adaptation to migratory access only makes sense in the context of an adaptive protocol operating in single-writer mode (or a single-writer protocol) where its purpose is to eliminate the need for explicit ownership messages. Compared to the base multiple-writer protocol, the adaptive protocol requires an extra message to acquire ownership in the following scenario. A processor takes a read fault on an invalid page, obtains the di s to validate the page, and then later takes a write fault on the page. With the base multiple-writer protocol, a twin is created but no messages are sent at the time of the write fault. With the adaptive protocol, an ownership request is sent. The scenario described is that of a migratory access pattern: a sequence of reads followed by a sequence of writes by one processor with no intervening accesses by other processors 26].
Detecting migratory access and eliminating the explicit ownership message is straightforward 9], 24]. If a page is migratory, when a processor performs its rst read from the page, it will fault because the page is invalid. Its request for the page will go to the processor that still owns the page. If that processor accessed the page in a similar, migratory fashion, it will preemptively send ownership along with the page. Later, if the page changes access pattern, for example, to producer/consumer, the overhead to switch will b eone ownership request.
B. Adaptive Run-time Aggregation of Pages B.1 The Basic Protocols and Their Tradeo s
Software DSM systems based on virtual memory techniques traditionally use the hardware page as the unit of access detection and as the unit of transfer. The singlewriter, multiple-writer, and adaptive protocols discussed in Section III-A all follow this approach. Depending on whether a single-or multiple-writer protocol is used, a di or a whole page is transferred, but in b o t hcases, access detection is done on a per-page basis, and the data transferred in a page fault response always pertains to a single hardware page. For simplicity, the discussion in this section is cast in terms of the multiple-writer protocol, unless otherwise noted, but it can easily b eextended to the singlewriter protocol and the adaptive single-writer/multiplewriter protocol described in Section I I I -A .
Both the unit of access detection and the unit of transfer can b eincreased, for instance by using a multiple of the hardware page size. Doing so trades o aggregation vs. the potential for increased false sharing. Aggregation reduces the numb e rof messages exchanged. If a processor accesses several pages in succession, a single page fault request and reply now su ce, where before multiple exchanges were re-quired. As a secondary bene t, the numb e rof page faults is also reduced. These gains, however, come at the expense of potentially increased false sharing. False sharing may lead to an increase in the amount of data exchanged. Assume, for instance, that processor p writes to successive pages a and b, and processor q accesses only a. With the base page size, only the di s for a are transferred, but if the page size is doubled, the di s for a and b are transferred. Worse, false sharing may also lead to an increase in the numb e rof messages.
B.2 The Adaptive Protocol
In this section, we present a protocol that continues to use the hardware page as the unit of detection, but adaptively coalesces pages into page groups for the purpose of transfer. The algorithm monitors the access patterns on each processor, and tries to construct page groups so as to increase aggregation without incurring the harmful e ects of false sharing.
The di s for all of the pages in a group are requested at the rst fault on any page that is a memb e rof the group. Requests addressed to the same processor are combined into one message, resulting in fewer request messages and enabling the data transfer to occur in one message as well. Even if the di s must come from di erent processors, there is still an advantage to requesting the di s for all pages in the group at once, because those processors can return the di s in parallel rather than in sequence.
A processor uses two di erent mechanisms for grouping pages. The rst mechanism is based on the past accesses on that processor itself. Essentially, the processor groups pages that were accessed during the previous synchronization interval. In order to avoid packet loss in the network, the implementation limits the maximum numb e rof pages in a single group to eight. Thus, more than one group may b eformed at a synchronization p o i n t. If two or more groups are formed, the pages are assigned to groups in the order they were accessed. The second mechanism is based on past accesses of other processors. It comes into play only if the rst mechanism did not produce a group for the missing page. The faulting processor checks if the page was modi ed by a single processor during the previous synchronization interval, and, if so, it requests from that processor any contiguous pages that were modi ed during that interval. Again, the numb e rof pages in any group is limited to eight.
In order to allow the membership of a group to change over time, the algorithm keeps every page invalid until the rst access to that page o c c u r s .Thus, a page may b ekept invalid, even though it has been updated by an access to another page within the same group. When the page fault handler is triggered by an access to such a page, it can simply change the page's state to valid without requesting any data. In this case, the page will remain a part of its group. If, however, the page is never accessed, it will b e dropped from the group at the next synchronization p o i n t. Hence, this strategy allows the algorithm to adapt to any change in the program's access pattern over the course of its execution.
C. Invalidate vs. Update C.1 The Basic Protocols and Their Tradeo s
In an invalidate protocol, a page is invalidated when the processor becomes aware of a remote modi cation. In LRC, this happens at the time of a synchronization. A synchronization message, for instance, a lock grant or a barrier departure message, contains a numb e rof (owner) write notices, indicating which pages have b e e nmodi ed. When the processor later accesses one of these pages, it incurs an access miss. Depending on whether a single-or multiplewriter protocol is in use, either the whole page or the di s are fetched. In an update protocol, instead, the modi cations to the page are sent with the synchronization message. Pages are never invalidated.
The tradeo s b e t ween invalidate and update protocols are well known 12]. Update protocols send substantially more data, including data that the processor may never access or that may b eoverwritten by newer data before the processor accesses the data originally sent. Invalidate protocols only retrieve the data for the pages the processor accesses, but they pay the penalty of the access miss fault and the round-trip latency to get the modi cations. In addition, in release-consistent software DSM, update protocols naturally include aggregation: when a processor modi es several pages, all the modi cations are sent in a single message to the other processor(s).
C.2 The Adaptive Protocol
The adaptive invalidate/update protocol updates the pages that the processor is expected to access and invalidates the other pages. As with the aggregation for invalidate protocols described in Section III-B, we limit a single update message to contain data for no more than eight pages in order to avoid packet loss in the network. Prediction of future accesses may b edone in a variety of ways. For programs based on barriers, each processor p records the set of processors from which it receives a page fault request for a particular page. When p arrives at the next barrier, if it has modi ed a particular page, it sends updates for that page to the processors in the set it has computed during the interval before the barrier 15]. These processors return negative acknowledgements to these updates, if they receive a second update for a page and have not accessed the page since the rst update. For data protected by a lock, we use the method proposed by, among others, Monnerat and Applications, input data sets, synchronization (l=locks, b=barriers), and sequential execution time record which pages a processor modi es while it holds the lock. Updates for these pages are sent to the next acquirer of the lock, while any other modi ed pages are invalidated.
IV. Experimental Environment
Our experimental platform is a network of eight 166MHz Pentium Pros running FreeBSD 2.2.5. Each machine has a 256K byte secondary cache and a 64M byte memory. The hardware page size is 4K bytes. The network connecting the machines is a switched, full-duplex 100Mbps Ethernet.
TreadMarks uses the UDP/IP protocol for interprocessor communication. The round-trip latency for a 1-byte message using the UDP/IP protocol is 196 microseconds on this platform. The time to acquire a lock varies from 256 to 393 microseconds. The time for an eight processor barrier is 481 microseconds. The time to obtain a di varies from 387 to 1,225 microseconds.
V. Applications
We use eight applications in this study. Water and Barnes-Hut come from the SPLASH benchmark suite 22]. Integer Sort (IS), 3D-FFT, Multigrid (MG) and Conjugate Gradient (CG) come from the NAS b e n c hmark suite 4]. Gauss is a Gaussian elimination kernel distributed with TreadMarks. ILINK is part of the FASTLINK package 21] of genetic linkage analysis programs. Table I summarizes the relevant characteristics of the applications. It includes for each application, the data set size used, the method of synchronization (locks, barriers, or both), and the sequential running times. Sequential running times were obtained by removing all synchronization from the TreadMarks programs these times were used as the basis for the speedup gures reported later in the pap e r .
VI. Results
For each of the applications we show speedups under the following scenarios: 1. the single-and multiple-writer protocols and the adaptive single-writer/multiple-writer protocol, 2. the adaptive single-writer/multiple-writer protocol plus dynamic aggregation, and 3. the adaptive single-writer/multiple-writer protocol plus invalidate/update adaptation, including aggregation of the updates. The e ects of dynamic aggregation are independent of whether the base protocol is the single-writer, multiplewriter, or adaptive single-writer/multiple-writer protocol. Hence, we only present the results for dynamic aggregation using the base protocol with the b e s toverall performance, the adaptive single-writer/multiple-writer protocol.
Similarly, the e ects of adaptation b e t ween invalidate and update are the same for the single-writer, multiplewriter, and adaptive single-writer/multiple-writer protocol. Furthermore, since the \update" part of the adaptive invalidate/update protocol inherently includes aggregation, and since aggregation is always bene cial with invalidate protocols, we compare the invalidate-based adaptive singlewriter/multiple-writer protocol with aggregation to the adaptive invalidate/update, single-writer/multiple-writer protocol.
A. Single-vs. Multiple-Writer Protocol Figure 1 shows the speedup on eight processors for each of the applications using the single-writer protocol, the multiple-writer protocol, and the protocol that adapts b etween the two, including the adaptation to migratory accesses. An invalidate protocol using the hardware page size is used, as in the base TreadMarks system.
We rst compare the non-adaptive single-and multiplewriter protocols. As expected, the amount of write-write false sharing determines the tradeo . The single-writer protocol performs b e t t e rthan the multiple-writer protocol on applications with no write-write false sharing and large overlapping di s (IS), performs comparably on applications with low write-write false sharing (Water, 3D-FFT, Gauss), and worse for applications with high write-write false sharing (Barnes, MG, CG, and ILINK). Comparing the adaptive to the non-adaptive protocols, we see from Figure 1 that the adaptive protocol matches or exceeds the speedup of the b e s tof the non-adaptive protocols. The adaptation that optimizes migratory access only affects IS. None of the other programs, such as Water, that have migratory data modify the entire page or su er from signi cant di accumulation. Consequently, they do not switch to single-writer mode, and thus the migratory optimization is not needed. For IS, it limits the ownership messages to one p e rpage p e riteration, instead of eight.
We do not present the memory demands for the protocols here, but, we o er the following anecdote: for a larger 3D-FFT data set (256 128 128), the single-writer and adaptive protocols performed well, running completely in main memory, while the multiple-writer protocol paged b ecause of the twins and di s it stored, causing a 15-fold increase in execution time. Figure 2 shows the speedups achieved with dynamic page aggregation, in addition to adapting b e t ween singleand multiple-writer and adapting to migratory access. As a baseline for comparison, we reiterate in Figure 2 the speedups from Figure 1 for the adaptive singlewriter/multiple-writer protocol. Five out of the eight applications bene t from dynamic page aggregation: BarnesHut, 3D-FFT, IS, MG, and CG. The bene ts for IS derive from the aggregation based on write accesses by other processors, while the bene ts for the other four applications derive from the past access patterns by that processor itself. In IS, which sees the greatest bene ts, processors exchange a large amount of data, leading to a signi cant reduction in the numb e rof messages, with the attendant performance bene ts. A similar argument explains the somewhat smaller improvements for Barnes-Hut, 3D-FFT, CG and MG.
Surprisingly, three of the applications that b e n e tfrom aggregation, Barnes-Hut, MG, and CG, su er from writewrite false sharing. This illustrates the fact that dynamic page aggregation can reduce the numb e rof messages without increasing false sharing.
C. Invalidate vs. Update Figure 3 shows the speedups for the protocol that adapts b e t ween invalidate and update (including aggregation of the updates), adapts b e t ween single-and multiple-writer, and adapts to migratory access. The results are shown along side those for the invalidate protocol that p e r f o r m s dynamic page aggregation from Figure 2 .
The bene ts of automatic adaptation b e t ween invalidate and update are questionable when such a protocol is compared to a base protocol that performs aggregation. Its bene ts are limited to avoiding page faults and round-trip latencies, resulting in a small improvement. In many cases, these improvements are o set by the additional data transfer. Typically, the additional data transfer comes from changes in the sharing pattern. For example, IS consists of a numb e rof iterations, each of which is divided into a numb e rof migratory phases followed by a phase in which the data produced by any single processor is consumed by all other processors. This latter phase causes the adaptive algorithm to send updates to all processors in the rst migratory phase of the next iteration. The negative acknowledgements halt these updates after two migratory phases, but the large amount of unnecessary data sent in these two phases causes performance to deteriorate substantially. Adaptation b e t ween invalidate and update is, however, attractive in some cases if the invalidate mode of the base protocol does not support aggregation. 27] ). Although the work described here is done in the context of a speci c system, TreadMarks, many of the ideas are applicable to other systems as well. First, the adaptation b e t ween single-and multiple-writer protocols carries over to all page-based systems. Second, aggregation should prove to b ebene cial to all systems, especially the ones that use smaller consistency units. Finally, the tradeo b e t ween update and invalidate also applies to these other systems, although the nature of the tradeo may change substantially if compiler support is used to determine the choice b e t ween update and invalidate 5], 11].
The multiple-writer protocol described in this paper is the one in use with the current version of TreadMarks 1]. The single-writer protocol is a variation of the one presented by Keleher 14] . The adaptive singlewriter/multiple-writer protocol extends our earlier work 2] on this topic. In this earlier work, we chose a protocol that started out in single-writer mode, because of its reduced memory use (no twins are ever made for pages that remain in single-writer mode). We found that the same reduction in memory use can b eachieved by a protocol that starts out in multiple-writer mode, by not creating the initial twin, which contains all zeroes. Starting in multiple-writer mode allows for a straightforward adaptation according to the size of the di s.
The adaptive DSM system described by Monnerat and Bianchini 19] is most closely related to our work. They also investigate the adaptation b e t ween single-and multiplewriter protocols, and adaptation b e t ween invalidate and update. In their system, pages are classi ed as migratory, producer/consumer or falsely shared. Single-writer mode is used for migratory and producer/consumer pages, while the falsely shared pages are maintained in multiplewriter mode. Updates are used only for migratory and producer/consumer pages. Keleher et al. 15] and Espeight et al. 23 ] have also investigated the b e n e t sof allowing a software shared memory system the choice b e t ween invalidate and update. However, to the b e s tof our knowledge, all of these studies were conducted in the absence of aggregation for the invalidate protocol, in ating the perceived bene ts of update. We have demonstrated that communication aggregation is the key to improving performance in b o t hinvalidate and update protocols. Adding dynamic aggregation to the invalidate protocol provides the same bene ts as using an update protocol, without the risk of sending extra messages. Amza et al. 3] investigated the bene ts of dynamic page aggregation. They did not, however, combine aggregation with other forms of adaptation. Lu et al. 18] found that aggregation is the main reason that message-passing programs outperform (software) shared-memory programs. Overall, they found that for six out of their eight applications the speedup on TreadMarks was within 85% of that achieved by PVM. With the best static page aggregation for each of those six applications, the speedup on TreadMarks improved to within 95% of the speedup on PVM. These results were obtained on two platforms, one of which, the 155Mbps ATM network of eight SPARCstation-20 Model 61 workstations, is similar to the platform used in this paper.
Our adaptive single-writer/multiple-writer protocol addresses the most extreme cases of a less common problem, di accumulation, found by Lu et al. 18] . Di accumulation in IS contributed to the worst performance with respect to PVM. TreadMarks' speedup was only 42% of PVM's. With di accumulation manually removed, the speedup improved to within 71% of the speedup on PVM. Our adaptive protocol automatically achieves a similar improvement.
Several other systems both hardware and software have investigated con gurability or adaptivity as a means of improving performance.
Shasta 20] features con gurable consistency units to address the requirements of applications with ne-grain sharing at the expense of higher memory overheads.
Munin 8] uses multiple protocols to handle data with di erent access characteristics. The novelty in our work is that it chooses automatically b e t ween di erent protocols. In Munin, the choice of protocol was based on somewhat burdensome user annotations.
Cashmere 25] improves on the home-based protocol introduced by Zhou et al. 27] , allowing dynamic migration of the home node. The home-based protocol allows a singlewriter optimization that avoids di ng overhead when the home node is the only writer for the page. The downside is that whole pages are fetched on faults, even if the amount of data modi ed is small.
Dubnicki and LeBlanc 10] proposed a scheme to reduce the impact on performance due to a mismatch b e t ween the cache block size and the sharing patterns exhibited by a given application. They adjusted the amount of data stored in a cache block according to recent reference patterns. They found that the adjustable cache-block-size implementation did better than the b e s t xed-size implementations for most of the programs in their suite.
The adaptation to migratory behavior was rst suggested by Cox and Fowler 9] and Stenstrom et al. 24] in the context of hardware shared memory machines.
Another form of adaptivity that is important in networks of workstations is adapting to environmental characteristics such as processor and network load 6], 7]. This form of adaptivity is orthogonal to the one discussed in this paper.
V I I I .Conclusions
We have described software DSM protocols that automatically adapt, on a per-page basis, to the access patterns in the application. The protocols dynamically choose b etween single-and multiple-writer protocols. Pages can b e dynamically aggregated into larger page groups. Finally, the protocols choose dynamically b e t ween invalidate and update. All adaptation is automatic.
The choice b e t ween the single-and multiple-writer protocols is based on the presence of write-write false sharing and on write granularity. In addition, the protocol detects migratory behavior, and chooses a version of the protocol optimized accordingly. Aggregation uses records of earlier accesses by a processor to coalesce pages into page groups, in the expectation that those pages will b eaccessed again by the processor. The choice b e t ween invalidate and update is based on whether we expect the destination to access the modi ed data before it is overwritten or not. The three adaptations can easily b ecombined.
Adaptation b e t ween single-and multiple-writer and dynamic aggregation proved to b ethe most bene cial, never causing any deterioration and providing substantial improvement for some applications. Our automatic adaptation b e t ween invalidate and update was less successful, showing at best gains equal to the dynamic aggregation adaptation and at worst serious performance deterioration. We speculate that it may b edi cult to nd a fully automatic, purely run-time algorithm for adaptation b e t ween invalidate and update, and that either compiler or user input may b enecessary to achieve good performance.
