The Relative Importance of
Concurrent Writers and Weak Consistency Models by Keleher, Pete
The Relative Importance ofConcurrent Writers and Weak Consistency ModelsPete KeleherDepartment of Computer ScienceUniversity of MarylandCollege Park, MD 20742-3255keleher@cs.umd.eduAbstractThis paper presents a detailed comparison of the relative importance of allowing concurrent writersversus the choice of the underlying consistency model. Our comparison is based on single- and multiple-writer versions of a lazy release consistent (LRC) protocol, and a single-writer sequentially consistentprotocol, all implemented in the CVM software distributed shared memory system.We nd that in our environment, which we believe to be representative of distributed systems todayand in the near future, the consistency model has a much higher impact on overall performance than thechoice of whether to allow concurrent writers. The multiple writer protocol performs an average of 9%better than the single writer LRC protocol, but 34% better than the single-writer sequentially consistentprotocol. Set against this, MW-LRC required an average of 72% memory overhead, compared to 10%overhead for the single-writer protocols.1 IntroductionSophisticated page-based distributed shared memory (DSM) systems achieve high performance through acombination of weak memory models and multiple-writer protocols. Although these techniques are oftencited as co-equal factors in good performance, no study has quantied their individual contributions tooverall performance.Hardware shared memory systems typically use single-writer protocols to keep caches coherent. Theseprotocols allowmultiple readers to access a given datum simultaneously, but require a writer to gain exclusiveownership of a datum before modifying it. Single-writer protocols are easy to implement because all copiesof a given datum are usually identical. Faults can be satised by retrieving a new copy of the data fromany other processor that has a copy. Unfortunately, the simplicity comes at the expense of message trac.Before a datum can be modied, all other copies must be invalidated, requiring those processors to takeaccess faults and retrieve new copies of the page if they are still accessing it.The unit of sharing in a page-based DSM is a virtual memory page; much larger than the cache lines usedin hardware shared memory systems. DSM's larger coherence granularity causes it to suer from increasedcoherence trac because of false sharing, or simultaneous accesses by dierent processors to unrelated partsof the datum.Since relaxed consistency models can hide the eects of false sharing, software DSMs such as Tread-Marks [9], Munin[2], and CarlOS [12], therefore, support multiple-writer protocols. These protocols allowmultiple nodes to simultaneously modify dierent sections of the same page. The modications are laterreconciled by creating summaries of each of the modications, called dis [2], and applying the dis to allcopies of the page. 1
The advantages of multiple-writer protocols for software DSMs are clear: the eects of false sharingare minimized and processors can locally decide to write any valid page without communicating with otherprocessors. The disadvantages are also clear: protocols are more complex, a ding mechanism must be usedto merge multiple modications to the same page, and the memory overhead is high. To date, however,there has not been a careful analysis of this tradeo.This paper presents such an analysis in the context of the Coherent Virtual Machine (CVM)[10] softwareDSM. CVM is a portable, user-level follow-on to the TreadMarks DSM. The system was specically written ina modular fashion in order to allow fair comparisons to be made between dierent protocols. CVM providesa set of basic classes that implement a generic protocol, lightweight threads, and network communication,complete with ecient end-to-end protocols that add reliability to the base UDP protocol. Additionalprotocols are created by deriving new classes from the base protocol class.For the comparison described in this paper, we implemented three protocols: a multiple-writer LRCprotocol (MW-LRC), a single-writer LRC protocol (SW-LRC), and a single-writer SC protocol (SW-SC).SW-LRC and MW-LRC are single- and multiple-writer protocols that implement the lazy release consistent(LRC) [8] memorymodel. We compare the performance of the two protocols in order to gauge the importanceof allowing multiple simultaneous writers. LRC was chosen as the consistency model because it allows falsesharing to be hidden more eectively than other memory models.We then compare the performance of SW-LRC to that of a carefully tuned sequentially consistent protocol(SW-SC) in order to gauge the importance of the consistency model relative to the choice of single or multiple-writer protocols. Other that the fact that the weaker memory semantics allow SW-LRC to delay coherencyactions longer than SW-SC, the protocols are very similar, and in fact share much code.Our comparison of the protocols shows that the performance of SW-LRC trails MW-LRC's by nine percentoverall, but SW-LRC actually averages three percent better than MW-LRC for six of the eight applicationsThis result has several causes. First, write-write false sharing is more rare than read-write[4] sharing. Second,the weak memory model hides most of the eects of read-write false sharing. Finally, communication in thisenvironment has a high startup cost for each message, while the per-byte cost is relatively low. The single-writer protocol has higher bandwidth requirements than the multiple-writer protocol because it transfersentire pages instead of dis. However, the high startup cost on messages means that the number of messagesis usually more important than the total amount of data sent.On the other hand, there was a large performance gap between SW-LRC and SC, showing that the choiceof memory model had a larger eect on performance than the choice of a single- or multiple-writer protocol.While this study presents data on only a single point in the spectrum of possible system characteristics,we believe that our testbed is typical for current systems. We also expect future technology trends to favorthe single-writer protocol. The widening disparity between memory bandwidth and processor speed willincrease the cost of di creation relative to network communication The latter is quickly growing less costlyas current architecture research has focused on the creation of zero-copy, memory-mapped network interfacesthat make communication latency independent of the memory hierarchy.Section 2 describes the CVM system and the two protocols in detail. Section 3.1 describes our exper-imental setup and presents detailed cost breakdowns of the component parts of both protocols. Section 3describes the overall performance of both protocols on a suite of shared memory programs, and relates theirperformance back to the application characteristics. Finally, in Section 4, we present our conclusions.2
2 CVM and ProtocolsThis section provides a brief overview of lazy release consistency, a description of the CVM system in whichthe protocols are implemented, and a description of the protocols themselves.2.1 Lazy Release ConsistencyLazy Release Consistency [8] is a variant of eager release consistency (ERC) [6], a relaxed memory consistencythat allows the eects of shared memory accesses to be delayed until selected synchronization accesses occur.Simplifying matters somewhat, shared memory accesses are labeled either as ordinary or as synchronizationaccesses, with the latter category further divided into acquire and release accesses. Acquires and releases maybe thought of as conventional synchronization operations on a lock, but other synchronization mechanismscan be mapped on to this model as well. Essentially, ERC requires ordinary shared memory accesses to beperformed only when a subsequent release by the same processor is performed. ERC implementations candelay the eects of shared memory accesses as long as they meet this constraint.Under LRC protocols, processors further delay performing modications remotely until subsequent ac-quires by other processors, and the modications are only performed at the other processor that performedthe acquire. The central intuition of LRC is that competing accesses to shared locations in correct programswill be separated by synchronization. By deferring coherence operations until synchronization is acquired,we can piggyback consistency information on existing synchronization messages. In comparison to ERC,LRC generally improves performance by eliminating consistency messages, further hiding the eects of falsesharing, and enabling new optimizations, such as piggybacking data movement on synchronization.We use lazy release consistent protocols for this study because they delay consistency actions longer thanother protocols, and therefore are more successful at hiding the eects of false sharing as well.2.2 CVMThe Coherent Virtual Machine (CVM)[10] system is a software DSM that supports multiple protocols andconsistency models. Like commercially available systems such as TreadMarks [9], CVM is written entirely asa user-level library and runs on most UNIX-like systems. Unlike TreadMarks, CVM was created specicallyas a platform for protocol experimentation.The system is written in C++, and opaque interfaces are strictly enforced between dierent functionalunits of the system whenever possible. The base system provides a set of classes that implement a genericprotocol, lightweight threads, and network communication. The latter functionality consists of ecient,end-to-end protocols built on top of UDP.New shared memory protocols are created by deriving classes from the base Page and Protocol classes.Only those methods that dier from the base class's methods need to be dened in the derived class. Theunderlying system calls protocol hooks before and after page, synchronization, and I/O events take place.Since many of the methods are inlined, the resulting system is able to perform within a few percent ofa severely optimized system, TreadMarks [9], running a nearly identical protocol. However, CVM wasdesigned to take advantage of generalized synchronization interfaces, as well as to use multi-threading forlatency toleration. We therefore expect the performance of the fully functional system to improve over the3
existing base. In order to simplify the comparison process, however, we do not use either of these techniquesin this study.2.3 Protocols2.3.1 SW-SCSW-SC is an implementation of a page-based, single-writer protocol. Either a single writer or multiplereaders may have copies of a given page at any one time. Page ownership is migrated to each processor as acopy of the page is requested, regardless of whether the request was for a write or a read copy. Ownershipcould be retained by the current owner while servicing read faults, but the status of the owner's page stillneeds to be downgraded to read-only. We chose to migrate ownership as an optimization favoring migratorydata.Like Mirage [5], we address the ping-pong problem by guaranteeing a processor a minimum quantum oftime with any newly retrieved page before it can be invalidated by another processor. The ping-pong problemoccurs when multiple processors simultaneously attempt to write the same page. A processor may requestand gain ownership of a page, but receive an ownership request for the page from another processor beforethe fault handler exits. In this case, the modication (however minor), is not completed before ownership islost, and ownership must be re-requested.Since modications are often small, even a very small quantum completely hides the problem in mostcases. We found that a quantum of two microseconds suced for both single writer protocols. Executiontimes for SW-SC when the guaranteed quantum was not used went up by a factor of four or ve for someapplications.2.3.2 SW-LRCSW-LRC diers from SW-SC in that a single owner can co-exist with multiple readers. Pages only need tobe invalidated when a processor receives a write notice via synchronization.The other major dierence is a consequence of this choice. Since servicing read faults does not requirethe owner to downgrade its writable copy to a read-only copy, we do not migrate ownership on read misses,and writable pages are not downgraded to read-only copies. This optimization improves performance for allthe applications that we tested.2.3.3 MW-LRCMW-LRC diers from SW-LRC in that multiple writers are allowed to concurrently modify the same page.These concurrent modications are merged using dis to summarize the updates. A di is created byperforming a page-length comparison between the current contents of the page and a twin of the page thatwas created at the rst write access. If each concurrent writer summarizes its modications as a di, thesystem can create a copy that reects all modications by applying the concurrent dis to the same copy.Concurrent dis only overlap if the same location is written by multiple processors without interveningsynchronization, which is probably a data race. All of our applications are free of data-races.The cost of creating a di is substantial (approximately three fourths of the cost of an RPC in our system),and will probably grow relative to processor speed as memory latency falls further behind processor speed.4
Systems such as Midway [1] avoid the page-length copies and comparison by using a modied compiler toannotate all shared writes with code that tracks accesses by using software dirty bits [17]. When the dineeds to be created, the software dirty bits are used to determine exactly which words have been modied.While the copy and comparison are avoided, the software dirty bit approach requires language support andadds overhead to every shared write. Recent work [3] shows that ding outperforms software dirty bits inobject-based systems, but the tradeo is less clear for page-based systems.2.3.4 TradeosThe most obvious advantage of the multiple-writer protocol is that it allows concurrent modications of thesame page without network communication. However, write-sharing is less common than other forms ofsharing, so this aspect of the multiple-writer protocol's performance is not likely to be signicant for manyapplications.A less obvious advantage of MWP is that the decision to modify a page that is present in read-onlystate is purely a local decision. Any page that is readable locally may be written with undertaking anyarbitration with other processors. Single-writer protocols, by contrast, require ownership of the page to begained before a page can be modied. If ownership is not gained with the read fault in the above example,a further network RPC must be performed in order to get it. Furthermore, once ownership is achieved, itinevitably migrates away within a short time. Therefore, unless single-writer protocols are carefully crafted,producer-consumer interactions will require two network RPCs, each of which can consist of up to threemessages.Note that this is the case for any single-writer protocol, including the sequentially consistent protocolimplemented in IVY [13] and the eager release consistent protocol implemented in Munin [2].The obvious disadvantage of multiple-writer protocols is that they must use dis to merge concurrentupdates to the same page. While the use of dis also decreases the amount of data transmitted across thenetwork, the number of messages sent is generally more important than the overall amount of data sentbecause of the high startup cost of message sends in our environment.Only improved CPU speed or dedicated hardware support will improve di creation cost, but improvednetwork interfaces and OS communications systems are likely to decrease the cost of network communicationat a signicantly faster pace in the near future. Furthermore, byte-copying continues to get more expensiverelative to oating point operations as the pace of CPU clock rate improvement continues to outpace memoryaccess time improvements. These trends show that the cost of ding, already high, is likely to rise in thefuture.2.3.5 Page LocationThe means of locating valid pages under the single-writer protocols needs to be explained a bit further. Anaccess miss is serviced by sending a page request to the page's manager, which forwards the request to thecurrent owner. This method requires three messages to satisfy a request in the usual case, or two when themanager is also the owner. Both IVY and Munin use a scheme based on following chains of probable ownersuntil the real owner is located, collapsing the probable owner pointers as a request is forwarded.Li and Hudac [13] showed that in the worst case, k faults of a page in an n-processor system can result ina worst case of O(n+ k logn) hops. Since k is in practice much larger than n for the system we are looking5
at, we can neglect the rst term and see that k faults can require k logn hops, for an average of log hopsper fault. An algorithm based on static-ownership has better worst-case performance even for a system ofonly 8 nodes, and the advantage grows larger as the size of the system increases.We conrmed that the static ownership case performs better in practice by driving a simple simulationwith a trace of faults incurred by the applications discussed in the next section. Our simulation shows thatin the eight processor case, our static ownership scheme requires an average of 1.83 messages per page miss,while the probable owner version required 1.86.3 Results3.1 EnvironmentOur experimental environment consists of a 16-node IBM SP2, although all performance numbers reecteight-processor executions. The SP-2 has a high-performance Omega switch in which each bi-directionallink is capable of a sustained bandwidth of approximately forty megabytes per second. Each processor is a66MHz RS/6000 Power2 processor.The applications were run on a version of CVM ported to MPI [14]. MPI does not yet allow handlers tobe called asynchronously on receipt of messages, so the system polls for incoming messages when outgoingmessages are sent. Unfortunately, two of the applications, TSP and QS, get poor performance using auto-matic polling because they each have phases where the sharing is very coarse-grained. Explicit polls wereinserted into these programs.Table 1 lists latencies for various actions taken by the DSM system. A \1-Hop Lock" takes two messages,and refers to the case where the lock manager is also the owner. A \2-Hop Lock" takes three messages. Pagefault terms are similar. The \segv handler" is the time consumed by AIX in calling a user-level handler fora page fault.3.2 ApplicationsThe applications used in this study include four applications from the SPLASH-2 [16] suite of shared-memoryprograms: Water-Nsquared (Water), Water-spatial (Spatial), FMM, and LU. The other four programs werelocally written: FFT, SOR, Quicksort (QS), and Traveling Salesman Problem (TSP). Table 2 summarizesthe inputs and characteristics for the applications. Six of the applications use barriers, ve use locks, andPrimitive TimeSimple RPC 1601-Hop Lock 2282-Hop Lock 3298-Proc Barrier 6691-Hop Page Fault 9392-Hop Page Fault 1376mprotect 12segv handler 128Table 1 Primitive Costs (usecs)6




































































































Figure 3 Messages Per Secondthan all applications other than TSP. TSP performs well despite ne-grained sharing because it synchronizesinfrequently (see Table 2).3.3.1 SharingWe used our run-time system to generate traces showing all page protection changes during executions. Thesetraces, timestamped by the globally synchronous clock on the switch of the SP-2 [15], drive a post-mortemanalyzer that tracks how long individual pages are shared in various modes. Table 4 shows the results.Several items are of interest. First, two applications, Spatial and QS, have substantial write sharing underMW-LRC. Pages are write-shared an average of 6% of the time in QS, and 61% of the time in Spatial. Whilethe other applications spend 12% of their running time on write faults under SW-LRC, these two applicationsspend 39%. Unsurprisingly, MW-LRC performs markedly better than the single-writer protocols for thesetwo applications, even though it performs an average of 3% worse than SW-LRC for the other six.An average of 68% of all write faults under SW-LRC are to pages that are already valid, and hence wouldnot require any network communication under a multiple-writer protocol. Overall, these promotions accountfor 40% of the total message count and 10% of running time.Several other numbers are worth explaining. Under MW-LRC, thirty-two percent of the pages in QS arenot valid anywhere in the system. Pages may become completely invalid under multiple-writer protocolswhen concurrent writers exchange invalidations at synchronization points. The pages are not re-validated inthis specic implementation of QuickSort because the sorted array is never read.One of the most interesting trends in the table is that not only is write-sharing converted to read-writesharing under SW-LRC, but write-only sharing is converted to read-write sharing as well. The latter situationMetric FFT FMM LU QS SOR TSP Water SpatialFault Per Second 847.3 2663.4 851.8 732.6 977.6 231.3 879.2 520.9Avg Di Size (bytes) 3782.4 146.0 3622.0 1687.5 2376.3 93.5 969.8 740.4Table 3 Sharing Granularity (MW-LRC)9
w-w w-o r-w r-r r-o invFFT MW 0.7 59.7 7.0 32.4 0.1 0.3SW 0.0 47.9 23.0 29.1 0.0 0.0FMM MW 0.0 0.0 0.0 0.0 0.0 0.0SW 0.0 84.7 7.2 7.7 0.3 0.0LU MW 0.1 63.7 6.0 30.0 0.2 0.0SW 0.0 34.4 36.0 29.6 0.1 0.0QS MW 5.9 77.2 9.9 5.4 1.6 32.5SW 0.5 76.1 18.3 3.7 1.4 0.0SOR MW 0.0 92.2 0.5 0.6 6.7 0.0SW 0.0 69.2 7.6 0.6 22.5 0.0Spatial MW 61.0 9.4 11.9 16.7 1.0 1.1SW 0.0 10.1 78.3 11.6 0.0 0.0TSP MW 0.4 20.0 28.9 50.6 0.1 0.0SW 0.0 7.4 28.6 63.9 0.1 0.0Water MW 0.1 11.8 5.9 82.0 0.2 0.0SW 0.0 19.1 8.1 72.7 0.0 0.0Table 4 Sharing Statistics for 8-Processor LRC Runs: r-read, w-write, o-onlyoccurs because read faults are more likely to be serviced by an up-to-date copy of the page under SW-LRCthan under MW-LRC. Hence, a subsequent synchronization between the read faulter and other processorsis less likely to invalidate the page.3.3.2 Di CostsThe use of dis adds four types of overhead: creation, application, handling, and garbage collection. Theaverage di creation cost during our tests was 125seconds. Di creation consumed between 0.2% and 2.2%of total run time for our applications. Di application and handling both consume only a small fraction ofthis time. As discussed in Section 3.3.3, garbage collection usually consumes between 1% and 4%, althoughthe actual number is highly dependent on application behavior and system parameters.3.3.3 Space OverheadTable 5 shows the memory overhead cost of communication buers, twins, and dis for each applicationunder MW-LRC. MW-LRC's space overhead for twins varies between 7% and 28% of the total amountof application data, between 1% and 227% for dis, and 15% and 273% for all overhead combined. Bycomparison, the average space overhead for the other two protocols is 10%.FFT FMM LU QS SOR Spatial TSP WaterComm Buer 100000 100000 100000 100000 100000 100000 100000 100000Twin Space 389632 591360 485376 287232 1008128 125952 59392 99328Di Space 680367 563955 565491 513857 98913 45123 798503 227382Total Ohead 1082499 1167815 1063367 813589 1119541 183575 870395 339210% Shared Seg 37% 40% 54% 21% 15% 17% 273% 122%Table 5 Memory Overhead for 8-Processor Runs Under MW-LRC10
Note that di space requirements can be arbitrarily reduced by garbage collecting, at the cost of in-creased CPU overhead. Since MW-LRC does not currently implement garbage collection, we summarize themechanism implemented in TreadMarks [7].Garbage collection is initiated whenever any process notices that di or write notice buers are becomingexhausted. The initiating processor adds a garbage collection request to its next barrier arrival message,and the master re-distributes this request to all processors with the barrier release.The request requires each of the processors to re-validate every page that had at one point been valid onthat processor, and then to inform the barrier master. The barrier master waits for validation acknowledg-ments, and then distributes collection-release messages. Upon receipt of a collection-release message, eachprocessor releases all resources used to hold dis, write notices, or twins.This mechanism validates more pages than strictly necessary, but spatial locality ensures that most ofthe re-validated pages will be accessed again. Despite the two extra rounds of communication required tovalidate the pages, garbage collection in the TreadMarks system never reduced performance by more than4%, and usually less than 1% [11].Neither of the other protocols use either dis or twins, but SW-LRC does use write notices. However,the notices use a trivial amount of space (one word per notice) and can be garbage collected at each barrierwithout global co-ordination.4 ConclusionsThe primary contribution of this paper is a better understanding of the tradeos involved in allowing con-current writers to the same page in DSM systems. We have implemented and compared the performance ofthree DSM protocols in the context of the CVM distributed shared memory system. The protocols are MW-LRC, a multiple-writer LRC protocol, SW-LRC, a single-writer LRC protocol, and SW-SC, a single-writerSC protocol. Overall, the multiple-writer version of LRC performed 9% better than the single-writer variantand 34% better than the sequentially consistent protocol. Stated another way, the performance impact ofthe choice in consistency models is approximately three times greater than the choice of whether to allowconcurrent writers.The primary performance dierence between the two LRC protocols is in their handling of write sharing.Contrary to our expectations, two of our eight applications exhibited signicant write-sharing. Pages werewrite-shared 6% of the time in QS, and 61% of the time in Spatial. Write sharing in the other applicationswas at least an order of magnitude less. While MW-LRC performed an average of 3% worse than SW-LRCfor the other six applications, it performed an average of 43% better for these two applications.Set against this, MW-LRC required an average of 72% memory overhead, compared to 10% overheadfor the single-writer protocols. Two thirds of this extra overhead is used for di storage. Di storagerequirements can be greatly reduced by garbage collecting, but only at the cost of increased CPU overhead1.The primary eect of MW-LRC's appetite for memory in future systems may be in cache and TLBpollution. Such eects are becoming more important as memory hierarchies deepen. Additionally, theding and twinning mechanisms needed by multiple-writer protocols make heavy demands on the memorysystem because of the large block comparisons and copies. Fortunately, these are exactly the types of memory1We will include garbage collection performance in the nal version of this paper.11
accesses that non-blocking caches are designed to address.We also found that applications spent much more time waiting on data than on synchronization, sug-gesting that run-time mechanisms that automatically prefetch data could be of signicant benet.Our nal contribution is the design and evaluation of SW-LRC. SW-LRC achieves performance com-parable to MW-LRC in most cases, but has less space overhead and is less complex. The down side isthat SW-LRC's performance is more sensitive to write-sharing and has higher bandwidth requirements.Nonetheless, we feel that the simplicity and space advantages make SW-LRC a natural choice for the currentgeneration of DSM systems that is even now making its way into the marketplace.References[1] B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. The Midway distributed shared memory system. InProceedings of the '93 CompCon Conference, pages 528{537, February 1993.[2] J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Pro-ceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152{164, October 1991.[3] Alan Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. A comparison of entry consistency and lazyrelease consistency implementations. DRAFT: submitted for publication, August 1995.[4] S.J. Eggers and R.H. Katz. A characterization of sharing in parallel programs and its application tocoherency protocol evaluation. In Proceedings of the 15th Annual International Symposium on ComputerArchitecture, pages 373{383, May 1988.[5] B. Fleisch and G. Popek. Mirage: A coherent distributed shared memory design. In Proceedings of the12th ACM Symposium on Operating Systems Principles, pages 211{223, December 1989.[6] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistencyand event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th AnnualInternational Symposium on Computer Architecture, pages 15{26, May 1990.[7] P. Keleher. Distributed Shared Memory Using Lazy Release Consistency. PhD thesis, Rice University,1994.[8] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed sharedmemory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages13{21, May 1992.[9] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory onstandard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference,pages 115{131, January 1994.[10] Pete Keleher. The Coherent Virtual Machine. Technical Report Maryland TR93-215, Department ofComputer Science, University of Maryland, September 1995.[11] Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. An evaluation of software-basedrelease consistent protocols. To appear in The Journal of Parallel and Distributed Systems.12
[12] Povl T. Koch, Robert J. Fowler, and Eric Jul. Message-driven relaxed consistency in a software dis-tributed shared memory. In Proc. of the First Symposium on Operating Systems Design and Implemen-tation, pages 75{85, Monterey, CA, November 1994. USENIX Assoc.[13] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions onComputer Systems, 7(4):321{359, November 1989.[14] Message Passing Interface Forum. MPI: A Message-Passing Interface, 1994.[15] C. B. Stunkel, D. G. Shea, B. Abali, M. M. Denneau, P. H. Hochschild, D. J. Joseph, B. J. Nathanson,M. Tsao, and P. R. Varker. Architecture and implementation of vulcan. In Proceedings of the 8thInternational parallel Processing Symposium, pages 268{274, April 1994.[16] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterizationand methodological considerations. In Proceedings of the 22nd Annual International Symposium onComputer Architecture, pages 24{37, June 1995.[17] Mathew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Software write detection for distributedshared memory. In Proceedings of the First USENIX Symposium on Operating System Design andImplementation, pages 87{100, November 1994.
13
