Coherence as an Abstract Type by Keleher, Pete
Sparks: Coherence as an Abstract TypePete KeleherDepartment of Computer ScienceUniversity of MarylandCollege Park, MD 20742keleher@cs.umd.eduAbstractWe are currently designing Sparks, a protocol construction library that we hope will allow us toimprove the performance of DSM systems to within a few percent of tightly-coupled multiprocessors.Sparks' abstractions will allow us to cleanly and systematically explore the design space of high-level synchronization operations, rather than proposing and implementing new operations in an adhoc fashion. Sparks' basic abstraction is the coherence history, an object that summarizes pastcoherence actions to shared segments. Our emphasis here is more on creating and investigatingthe abstractions that make a broad variety of optimizations possible, rather than on the individualoptimizations themselves. However, we will thoroughly quantify the performance gains allowed bythe synchronization types created via the Sparks library.Our overall goal is to improve DSM performance. We will gauge our success by targetingapplications from benchmark suites such as SPLASH-2, as well as representative applications fromcomputational chemistry, biology, and satellite image analysis. Sparks' history abstraction will beused to make several important contributions towards our performance goal: (1) ecient techniquesto implement high-level synchronization, (2) ecient automatic prefetching using prefetch playbacks,and (3) external interfaces to run-time libraries and automatically parallelized code sections. Byimproving DSM eciency, we hope to make the shared memory paradigm more appealing, andtherefore useful, to the research community.1 IntroductionShared memory is a more intuitive programming model than alternatives such as message-passing.Software distributed shared memory (DSM) systems provide the abstraction of shared memory toapplications running on networks of workstations and distributed memory machines such as theSP-2, CM-5, and Paragon. Unfortunately, the latencies for global operations in either environmentare several orders of magnitude more expensive than on tightly-coupled multiprocessors. As aresult, early DSMs performed well for only a restricted class of applications.Previous work [5, 7] addressed part of the problem by proposing weak memory consistencymodels. These memory models allow processors' views of shared memory to temporarily diverge,bringing them back into agreement only at subsequent synchronization. This work signicantlybroadened the class of applications that performs acceptably on DSMs, but falls short of allowingDSMs to rival the performance of multiprocessors in general.We are currently designing Sparks, a protocol construction library that we hope will allow us toimprove the performance of DSM systems to within a few percent of tightly-coupled multiprocessors.1




























L, inv(x)Figure 3 Lazy Release Consistency3









Inv(p   )x
P1 P2 P3









Inv(p   , p   )x y
Inv(p   )
x
P1 P2 P3


























i 0Figure 7 Interval Partial Order3.2 Programming with SparksThe initial prototype of Sparks will be written as a C++ class library. Later versions may migrateto a language-based approach as we expand the scope of the research to include compiler-basedanalysis of synchronization and automatic protocol verication.A simplied denition of the History class is shown in Figure 8. This denition allows historiesto be added, subtracted, and applied. The set * routines allow the extents to be directly modied.Additionally, some protocol implementations of get data() will return all data present locallywhose creation is described by the history's write notices. This apply data routine can be used toupdate pages when the history is applied elsewhere. The register routine is used to tell Sparksto begin recording shared writes in a given history.Adding histories Hi and Hj results in a new history that contains all intervals named in eitherHi or Hj . For example, the coherence operations that take place in a lock acquisition on an LRCsystem can be expressed by:Hacq += Hrel;apply Hacq;The existence of a history detailing modications to shared memory does not imply that anycoherence operation has taken place. Consistency action only occurs when a history is applied tothe local version of shared memory. In the above example, the rst line merely creates a descriptionof shared modications seen by either the acquirer or releaser. No action is performed until theresulting history is applied in the second line. All three extents may be modied by an addition.Histories may also be subtracted. Subtracting Hi from Hj limits the temporal scope of theresulting history to the interval of time seen by Hj but not by Hi. History subtraction can beused to create a compact representation of all shared updates to the extents covered by history Hinduring a specic interval of time:History Hsave;extern History Hlocal;void begin record () fHsave = Hlocal;g 6
class History fTemporalExtent temporal;SegmentExtent segment;ThreadExtent thread;void register(int on or o);void operator += (History *);void operator -= (History *);void apply();UpdateData *get data();void apply data(UpdateData *);void set temporal(TemporalExtent *, TemporalExtent *);void set segment(SegmentExtent *);void set thread(ThreadExtent *);g; Figure 8 History ClassHistory * end record () freturn Hlocal - Hsave;gwhere we assume Hlocal is registered (recording is turned on). The history returned by end recordcontains a complete record of the intervals that were created or learned about between the calls tobegin record and end record. The next section presents possible uses of this type of construction.3.3 High-Level Synchronization: QueuesAs discussed above, unintended consequences can result from using constructs as powerful as Locksto build high level synchronization types. In the case of the lock-based queue in Figure 5, theunintended consequences are processor P2's invalidation of page px, and P3's invalidation of py.The only intended consequence is P3's invalidation of px.The Sparks-based queue implementation in Figure 4 stores the history of the data producerwith the object in the queue. When the data is consumed by P3, P1's history is applied P3.3.4 Reductions and Mutual ExclusionMany operations in parallel programs can be described as reductions, or operations that are associa-tive and commutative. The semantics only require mutual exclusion between consecutive reducers.However, reductions are typically implemented using locks. Locks are stronger than necessarybecause their implementation updates later reducers with all coherence actions taken by prior re-ducers. The only coherence actions that need to be performed are those to the data modied bythe reduction.Reductions can be implemented in Sparks similarly to locks, except that a segment extent isused to limit the scope of the histories transferred between consecutive reducers. The below codepresents the relevant aspects of a reduction acquisition:(1) reduce acquire(SegmentExtent * object) f(2) send request for object to current owner7
(3) extract history Hlast from reply(4) Hlast  >set segment(object);(5) Hlast  >apply();(6) gLine 4 reduces the scope of the coherence actions contained in Hlast to only those that aectthe pages in object, and Line 5 applies the result.3.5 Producer-Consumer SharingA common behavior in parallel programs is a stable pattern of data creation by one thread andconsumption by another. Such behavior is usually termed producer-consumer. We present twopossible mechanisms to optimize data transfer when the user has indicated that producer-consumerinteraction is possible.Page-based DSMs have no explicit association between data and the synchronization used toguard it. However, a given program usually obeys a fairly simple mapping between the two. Ourtrace-driven simulation shows that 81% of all access misses on shared data in Water [17], a relativelycomplicated molecular simulation, can be avoided by replaying data transfers. The access misses ofJacobi, a coarse-grained application, can be completely covered with simple analysis. Data accesslatencies directly account for 17% of the runtime for Water on top of CVM on an eight-node SP-2,and indirectly account for more through synchronization delays and load imbalance [10]. Since ourexperience indicates that access miss latency is at least as important as synchronization latency,we expect prefetch mechanisms to provide signicant performance benets.The above routines begin record and end record can be used to cleanly record data creation.The code below shows pseudo-code for a possible implementation:SegmentExtent *object;begin record(Hlocal);: : :History *Hrec = end record(Hlocal);Hrec  >set segment(object);Msg *msg = new Msg(Hrec  >get data());msg  >send( consumer proc id );If Hlocal's SegmentExtent covers all of shared memory, the recording calls generate two snap-shots of all local updates made to shared memory between the calls. The routine end recordreturns a history containing only those changes made to shared memory between the two calls.The scope of this history is then limited to object by calling set segment with a SegmentExtentthat covers only the shared pages that contain the object. The data corresponding to these modi-cations is accessed through the get data() method, and pushed to the expected consumer of thedata. At the consumer side, the data is applied to the local view of shared memory, circumventingthe invalidations and access misses that would otherwise have been required to retrieve the codefrom the producer.A second mechanism is useful when the sharing pattern is not stable. The producer usesbegin record and end record to delimit creation of data. At the rst request for any page of thenew data by another processor, the DSM transfers the entire block of newly created data by usingthe history's get data() method. Again, Sparks lets us easily capture and transfer the set of pagesmodied during the producer phase. 8
3.6 Prefetch PlaybacksPrefetch playbacks is a technique that allows us to record access misses taken during one itera-tion, and to play back the next update to the same data as an update during the next iteration.Section 3.5 describes a mechanism that allows a producer to update a known consumer. Prefetchplaybacks build on this mechanism by allowing a producer to use past history to automaticallyidentify the consumer.Coherence histories are essentially a record of write faults. We can use a similar mechanismto record read faults. Routines analogous to begin record and end record are used to create aReadHistory object that summarizes read misses taken between the two calls. These read historiesare then matched with producers at the next global synchronization (barrier) to identify targetsfor updates. The following pseudo-code illustrates a use of this technique:for (i =  1; i < 1; i++) fif (I am a producer) fint prod = -1;begin record();produce(i);History *hist = end record();if (hist  0)send hist  >get data() to prod;prod = produce barrier(hist);g else fbegin read record();consume(i-1);ReadHistory *read hist = end read record();consume barrier(read hist);ggThe barrier routines append the histories to message arrival messages, and the barrier mastermatches producers to consumers by comparing SegmentExtents. During the next iteration, newlycreated data is pushed to the pid returned by produce barrier() while waiting for the barrier tocomplete.Recording and playing back data transfers was rst used by the Mukherjee [14] in the contextof a sequentially consistent DSM. Our work diers in two ways. First, our recording mechanismswill be part of the synchronization type denitions. The playbacks will be initiated by automaticheuristics, making them more reliable and easier to apply. With the exception of dierentiatingbetween producer and consumer barriers, all of the above mechanism could have been hidden insidethe barrier routines. We pulled much of it outside the barrier routines for explanatory purposes.Second, our technique will be used for prefetching, not to maintain coherence. We will not violatecorrectness if subsequent iterations access dierent data.3.7 Compiler/Runtime Library InterfacesWe will use Sparks to generate interfaces to code created by the SUIF [16] parallelizing compiler,and to the CHAOS [4] runtime library.Our collaboration with Dr. Tseng's compiler group [16] will use communication analysis todetermine when data will be needed by other processors. By combining this information withstandard dataow and dependence analysis, the compiler can initiate asynchronous data updatesand overlap communication with computation. 9
Similar work is being pursued in collaboration with Dr. Saltz's CHAOS [4] group. The generalapproach is to create mechanisms that let CHAOS assume sole responsibility for consistency in aconned region of shared space. The same mechanism is later used to turn the default consistencymanagement back on. Sparks' history abstraction allows us to develop ecient representations ofshared state that can be used when control is transferred.4 ConclusionsParallel systems are clearly reaching a point where increasing aordability is making their widespreadacceptance possible. However, this transition will not take place unless parallel machines are easyto program, and perform well. Current DSM systems handle the rst problem, but do less wellwith the second.Our research will bridge the gap between loosely-coupled and tightly-coupled systems by usingthe Sparks abstractions to reduce and optimize data movement in DSM systems. As large-scalesystems increasingly resemble multiprocessor nodes connected by DSM, we expect our techniquesto become common not only in clusters of stock workstations, but in the most powerful systems aswell.Bibliography[1] B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. The Midway distributed shared memorysystem. In Proceedings of the '93 CompCon Conference, pages 528{537, February 1993.[2] A. Black, N. Hutchinson, E. Jul, H. Levy, and L. Carter. Distribution and abstract types inEmerald. IEEE Transactions on Software Engineering, SE-13(1):65{74, January 1987.[3] J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin.In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152{164,October 1991.[4] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communication optimizations for irregular sci-entic computations on distributed memory architectures. Journal of Parallel and DistributedComputing, 22:462{479, September 1994.[5] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memoryconsistency and event ordering in scalable shared-memory multiprocessors. In Proceedings ofthe 17th Annual International Symposium on Computer Architecture, pages 15{26, May 1990.[6] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: High-performanceall-software distributed shared memory. To appear in The Proceedings of the 15th ACM Sym-posium on Operating Systems Principles.[7] P. Keleher. Distributed Shared Memory Using Lazy Release Consistency. PhD thesis, RiceUniversity, 1994.[8] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributedshared memory. In Proceedings of the 19th Annual International Symposium on ComputerArchitecture, pages 13{21, May 1992. 10
[9] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. Treadmarks: Distributed sharedmemory on standard workstations and operating systems. In Proceedings of the 1994 WinterUsenix Conference, pages 115{131, January 1994.[10] Pete Keleher. The relative importance of concurrent writers and weak consistency models.Technical Report CS-TR-3543, University of Maryland, October 1995.[11] Povl T. Koch, Robert J. Fowler, and Eric Jul. Message-driven relaxed consistency in a softwaredistributed shared memory. In Proceedings of the First USENIX Symposium on OperatingSystem Design and Implementation, pages 75{86, November 1994.[12] J. Kuskin and D. Ofelt et al. The Stanford FLASH multiprocessor. In Proceedings of the 21thAnnual International Symposium on Computer Architecture, April 1994.[13] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocessprograms. IEEE Transactions on Computers, C-28(9):690{691, September 1979.[14] Shubhendu S. Mukherjee, Shamik D. Sharma, Mark D. Hill, James R. Larus, Anne Rogers,and Joel Saltz. Ecient support for irregular applications on distributed-memory machines.In Proceedings of the 1995 Conference on the Principles and Practice of Parallel Programming,July 1995.[15] Steven K. Reinhardt, James R. Larus, and David A. Wood. Tempest and typhoon: User-levelshared memory. In Proceedings of the 21th Annual International Symposium on ComputerArchitecture, April 1994.[16] S. Tjiang, M. E. Wolf, M. Lam, K. Pieper, and J. Hennessy. Integrating scalar optimization andparallelization. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages andCompilers for Parallel Computing, Fourth International Workshop, Santa Clara, California,August 1991. Springer-Verlag.[17] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Charac-terization and methodological considerations. In Proceedings of the 22nd Annual InternationalSymposium on Computer Architecture, pages 24{37, June 1995.
11
