Introduction
The idea of implementing a shared address space in software across a network of computers using the virtual memory mechanism-the shared virtual memory (SVM) approach-was proposed more than a decade ago [32] . An SVM protocol provides a coherent shared virtual address space at page granularity, using the local page table for access control and message-passing for inter-node communication.
Several factors limit the performance of a shared virtual memory implementation. First, the large granularities of coherence and communication cause unnecessary or artifactual communication, especially when a strict consistency model like sequential consistency [31] is used. The large granularity of coherence (a page) can cause false sharing when multiple processors access different variables co-located on the same page and at least one access is a write. The large granularity of communication upon a page fault (also a page) can cause fragmentation when not all the data in the fetched page are needed by the faulting processor. These effects occur in hardwarecoherent shared memory as well, but they are less significant because the granularities are much smaller (a cache block). The second factor is that the cost of each communication operation is much larger in SVM since it is performed through software messages. Third, synchronization is slow because it is also implemented through explicit messages. Moreover, page faults that occur inside critical sections dilate the critical sections and increase serialization, especially in the presence of contention. Finally, protocol processing incurs large overheads because it is usually performed in software on the main processor, so the main computation must be interrupted to service remote protocol requests.
Breakthrough protocols from Rice University in the early 1990s [9, 27] used the recently introduced release consistency model [16] to breathe new life into the the SVM approach. Since then, SVM research has witnessed a very active and fruitful decade (see Figure 1) , with many research groups building on one anothers ideas to push performance higher [4, 21, 39, 38, 47, 2, 19, 30] . Of the three layers that affect end performance (application, protocol/model, and communication architecture), most of the efforts so far have been in the lower two: relaxed consistency models and protocol implementations to reduce communication frequency and traffic [4, 21, 47, 2] , and additional hardware support in the communication architecture to reduce communication costs [39, 11, 35, 19, 18, 7, 30, 29] .
With the relative maturity of protocols, in the last couple of years SVM research has moved to greater emphasis on the application layer and the synergies available across layers.
New areas are being emphasized like application-driven performance evaluation, application restructuring for SVM and performance portability across platforms, protocol enhancements driven directly by bottlenecks encountered in real applications, protocol interactions and synergies with architectural support, comparison with alternative, fine-grained software shared memory approaches, and software tools to improve programmability. We believe that these less researched areas now have the most promise for driving future advances: in reducing the still considerable performance gap between hardware and software shared memory, in improving the protocols and system support further, and in understanding programmability and scalability. This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework.
The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency models, protocol laziness, architectural support, and applications and application-driven research. A section of the paper is devoted to each category. The last section discusses other important emerging issues related to SVM: the alternative of fine-grained software coherence, hybrid protocols that implement software shared memory across multiple hardware-coherent multiprocessors, and scalability. The paper summarizes comparative performance results from the literature, discusses their limitations, places existing protocols in a framework based on laziness, and identifies the lessons learned so far and some key outstanding questions. Figure 1: Research in SVM. The number in brackets next to an arc is a reference to the paper that created that arc. The figure treats LRC and ERC as different protocols implementing the RC consistency model, though they are in fact slightly different consistency models. MW is multiple writer, SW is single writer, and H means home-based. SC, RC, EC and ScC are the sequential, release, entry and scope consistency models.
Relaxed Consistency Models
Unlike natural implementations of sequential consistency (SC), relaxed memory consistency models allow the propagation and application of coherence operations (e.g. invalidations) to be postponed to synchronization points, greatly reducing the impact of false sharing and the frequency of coherence operations. The intuition behind these models is that a user cannot distinguish the behavior of a system running under such relaxation from one that uses a sequentially consistent model, as long as programs are "properly labeled" [16] , which means that they are free of data races and all synchronization is identified as such.
Weak consistency [12] was the first model based on this idea. However, release consistency (RC) [16] , which separates synchronization operations into acquires and releases, was the first to inspire a major breakthrough in the performance of SVM [9] . RC guarantees that coherence operations due to writes are performed before the releases that follow the writes. Lazy release consistency or LRC [27] further delays the deadline for coherence operations to be performed to the time of acquires rather than the releases that precede them. It is a slightly lazier model than RC, although this cannot be distinguished by a "properly labeled" program. LRC is discussed further in Section 3.
Relaxing the consistency model beyond LRC is much more difficult due to both protocol and especially programming complexity. However there is some potential for performance improvement, especially in applications with a lot of lock synchronization.
Entry Consistency [4] proposed the idea of binding data to synchronization variables, and making only the data bound to that variable consistent at a synchronization event. That is, the consistency model governs not only when data are made coherent but also which data should be made coherent at a synchronization point. However, the programmer has to provide the binding, which is a major burden. The granularity in entry consistency is that of user-defined objects or regions, not transparent pages.
Scope consistency (ScC) [21] tries to achieve the benefits of entry consistency without the burden of explicit binding. Instead, synchronization variables define scopes through which memory is viewed, and an implicit association of data (pages) with scopes is achieved dynamically when a write access occurs inside a scope. Unfortunately, there are still programs where additional programming effort is needed beyond programming for LRC. ScC implementations are page-grained, unlike EC, so they can obtain the prefetching effect of pages while keeping false sharing low. ScC has been implemented in the Brazos prototype [42] , showing performance improvements.
EC may fit well into object-oriented languages, and ScC's small programming burden may be alleviated by tools that check if a properly labeled program for RC also satisfies ScC. Currently, however, the programming burden keeps these models outside the mainstream, and systems based on RC prevail.
Protocol Laziness
A memory consistency model specifies when coherence operations and data must become visible. However, it can actually be implemented with various degrees of laziness in the propagation and the application of both coherence operations and data modifications to pages. In fact, the history of SVM research so far is largely a history of protocol laziness. This research also led to the distinction between single-writer and multiple writer protocols [24] , as well as between home-based [19, 21, 47] and no-home or distributed [27, 42] data collection.
Greater laziness typically implies less frequent communication and protocol operations, but also greater programming difficulty and protocol state. For example, while release consistency was first defined for hardwarecoherent systems, those systems tend to propagate and apply operations as soon as possible to keep hardware simple. It is software protocols that can maintain the state necessary to fully exploit its laziness, and can implement even lazier models like LRC.
SVM protocols based originally on release consistency have been developed along different but related paths (see Figure 1 ). Most published papers use the previous protocol along their path as the base case for comparison. However, most performance evaluations so far have been limited by two factors. First, they have typically not used a comprehensive enough set of real applications that exercise different relevant characteristics. Second, they have made assumptions about important lower-level protocol and implementation mechanisms that can make a great difference to the higher-level tradeoff being investigated. We will point these out as we summarize results from the literature.
This section divides the evolution of protocols for maintaining consistency into three parts: laziness in propagating coherence operations to other processors, laziness in applying the coherence operations at those processors, and laziness in propagating and applying data modifications.
Laziness in Coherence Propagation
The simplest propagation method is eager-propagation release consistency: invalidations for shared pages that were locally modified before a release (but after the previous release) are sent at release time to all the nodes that have copies of these pages. With lazy propagation, as used in protocols implementing LRC [27] , the invalidation corresponding to a write by processor A is conveyed to a processor B only on demand upon B's next acquire operation, if that acquire is causally "after" the release by the writing processor A. Munin [9] and TreadMarks [25] were the first SVM systems with eager and lazy software release consistent protocols, respectively.
Laziness has implications for the amount and lifetime of coherence state that must be maintained. With eager propagation, coherence information is propagated at the release to all processors that might need it, so that the information can be discarded by the releaser afterwards. The only data structure needed is a directory to keep track of sharers, similar to that used in hardware cache coherence. With lazy propagation, coherence information is propagated at an acquire operation and only to that acquirer. This has two implications. First, the coherence information cannot be discarded at a release since there is no way of predicting which processors will issue causally subsequent acquires and when. Second, a releaser has to provide the next acquirer with all the coherence information (write notices) that the releaser has seen in its past but the acquirer has not seen, not just that due to writes by the releaser since its last release; thus, even "old" write notices received from others and already applied locally cannot be easily discarded but must be held "in transit" (until a global synchronization point). Orchestrating the transitive propagation of coherence information requires an LRC protocol to store more complex state for much longer, using vector timestamps to keep track of the causal partial ordering of synchronization events [25] . On the other hand, the main advantage of lazy propagation is a reduction in coherence traffic; for example, if a lock is acquired and released many times in succession by the same node no remote coherence operation is required, even if the pages being written are cached remotely.
Laziness in Coherence Application
Even if a protocol propagates write notices eagerly at a release, various degrees of laziness can be used for actually applying or performing the coherence operations at the destination and in processing acknowledgments. These are often glossed over when naming and comparing protocols, but they can have significant performance effects and can even affect the consistency model.
We will call eager release consistency or ERC protocols those that use eager propagation and also apply (and acknowledge) write notices eagerly as soon as they are received. Delayed consistency (DC) [13] protocols use eager propagation but lazy application. Incoming invalidations are not applied right away but are queued until the destination node performs its next acquire operation (whether causally related to that release or not, since it doesn't know). Acknowledgments are typically sent back as soon as invalidations are received so that the releaser can proceed, though the releaser may choose to reduce release latency further by queuing the incoming acknowledgments and checking for them only at its next acquire.
While lazy propagation can reduces coherence traffic, lazy application can reduce the number of page misses incurred. For example, a DC protocol avoids the misses which might otherwise have occurred between the time invalidations are received and the time the next acquire operation is performed. Lazy application also allows greater amortization of expensive calls to the operating system to change page protection, by doing more of them together. So-called LRC protocols are even lazier, applying (and even propagating) invalidations only at the next causally related acquire. However, they require timestamps to keep track of the partial orders. If the savings in coherence traffic and page faults are small, the complexity of LRC may even cause it to perform worse than DC or even ERC.
Another important difference between protocols, whether lazy or eager, is whether they are single-writer or multiple-writer. In single-writer protocols, one writer may coexist with other readers but not writers. The single writable copy of a page migrates along with the ownership from one writer to another. In multiple-writer protocols, multiple writers can coexist and independently modify their copies of a page, and the only communication and coherence is at synchronization points. We will discuss this issue further in Section 3.3. Table 1 reports the results of a comparison of a particular ERC protocol similar to Munin and the TreadMarks LRC protocol [26] . Both are multiple-writer protocols. However, the comparison is not purely about laziness. This ERC protocol uses a hybrid update-invalidation approach (see Section 3.3.2) while the LRC protocol is invalidation-based. LRC usually outperforms ERC, except for ILINK and FFT in which it is the update nature of the ERC protocol that makes the difference. Focusing only on laziness, a single-writer DC protocol, a single-writer LRC protocol (similar to that in [28] ), a multiple-writer LRC protocol (similar to that in Section 3.3.2) and a sequentially consistent protocol (all invalidation-based) have been compared on a number of SPLASH-2 applications that cover most of the key sharing patterns [48] . Table 2 shows some of the results as measured on the Wisconsin Typhoon-zero cluster [39] . The lazier application and propagation of the LRC protocol have significant advantages over the DC protocol, especially in complex irregular applications that use substantial lock synchronization. Also, the multiple-writer protocol performs much better than the single-writer for such irregular applications (though sometimes multiplewriter protocols can perform worse for applications whose patterns of page access are mostly single-writer [28] ). The study is limited by the fact that the platform provides hardware rather than VM-based access control for pages. The trend today is toward lazy, multiple-writer protocols. However, the performance/storage/complexity tradeoffs in laziness are not yet clear for emerging platforms and real applications and bear further research (see also Section 4). A multiple-writer DC protocol is implemented in the Cashmere system [30, 29] .
Laziness in Data Propagation
Like coherence information, data modifications can be propagated and applied with varying degrees of laziness. We define laziness of data propagation with respect to when coherence operations are propagated rather than when the corresponding local writes or releases are performed. Unlike coherence information, buffering incoming data modifications and applying them later is not likely to provide much benefit, unless messages are received without involving the main processor. If the processor is involved, it may as well apply the data right away. We therefore treat data propagation and application together.
Managing data propagation is quite simple for singlewriter protocols [32, 8, 24] . Multiple-writer schemes propagate data modifications more lazily, but they require a mechanism to keep track of the writes by different processes to a page and to merge them before another process needs to see them according to the consistency model.
A popular way to manage the tracking and propagation of data modifications entirely in software for multiplewriter protocols is to use diffs. At a release point, a writer determines the changes it has made to each page that it has written since its last synchronization operation (i.e. during the current synchronization interval) by computing the differences (diff) between the versions of the page at the beginning and end of the interval. The diffs can then be propagated with varying degrees of laziness.
Eager versus Lazy Data Propagation
The most eager form of propagating diffs is to send them eagerly along with the (eagerly propagated) coherence information. This is like an update-based protocol but with less useless traffic since updates are propagated only at releases. Hybrid update/invalidate protocols further reduce useless update traffic [9, 42] .
The laziest form of data propagation is to fetch the diffs when they are actually needed by the next sharer, i.e. on a page fault. The faulting processor obtains the diffs from the relevant writers and merges them into its copy of the page. In fact, even the creation of the diffs can be (lazily) delayed till the first incoming request for them.
While ingenious, this distributed multiple writer scheme has some potential problems. First, when there is indeed write-write sharing, a processor will need to obtain diffs from multiple writer nodes, generating a lot of messages that increase the page fault latency. Second, the need to retain diffs at the writers (like write notices) can generate very large memory consumption, even larger than the application data set size itself [47] , greatly limiting scalability. Periodic garbage collection is needed, which adds overhead. Third, the diffs to be transferred can accumulate rapidly and increase communication traffic. For example, under a single-writer, migratory sharing pattern, each process that acquires a lock fetches the diffs, applies them and generates new diffs. The incoming diffs cannot be discarded after application because they may need to be passed on to the next acquirer, so they accumulate. For certain sharing patterns this problem has been addressed by the adaptive protocols discussed later in this section. Finally, diff-ing requires several memory-intensive operations: a page copy to create a clean version of the page at the beginning of a synchronization interval, word by word comparison to compute the diff, and applying the diff upon a page fault. In addition to their direct costs, these operations substantially pollute the primary cache and hurt application performance further [20] .
Home-Based Protocols: An Intermediate Approach
An intermediate form of laziness in data propagation is to propagate the modifications in two stages. A home node is selected for each shared page and modifications are propagated eagerly (at or before a release) to the home only [19] . This can be done either using diffs [21, 47] or architectural support as discussed later [19, 30] . Unlike the home copy, the non-home copies of a page are kept coherent through invalidations as usual. On a page fault, a non-home processor obtains the whole page from the home rather than obtaining diffs from previous writers. Now, even when all-software data propagation with diffs is used, the diffs created at release time are immediately sent to the home and then discarded. At the home, diffs are applied as soon as they are received and discarded there too, so diff storage needs are very small. Diff propagation to the home can be delayed from the release to the next incoming acquire, but they must be propagated by the time of the next causally related acquire so that the data reach the home by the time the page fault occurs.
The main performance advantages of the home-based multiple-writer scheme are (i) fewer messages since page faults can be always satisfied with a single round-trip to the home, (ii) a given diff is applied only once (at the home) and (iii) writes performed at the home do not generate diffs, and (iv) there are no remote data fetches at the home. This last two features can make a big difference to performance for many applications in which a page is mostly written by a single processor. Potential disadvantages are (i) the whole page is fetched on a fault even if only one word of it has been modified and (ii) performance can be poor for some applications if homes are not assigned properly. Table 3 compares the all-software home-based multiplewriter LRC (HLRC) protocol with a traditional no-home, multiple-writer LRC protocol. The former is found to perform substantially better on this platform, with the performance gap increasing with the number of processors [47] . Similar results are seen in earlier, simulationbased comparisons between a home-based protocol with some hardware support and a no-home LRC for a wider range of applications [19, 20] .
An alternative multiple-writer scheme that shares some properties of home-based write collection was used in [26] to implement an ERC protocol (the one in Table 1 Table 3 : Speedups for traditional no-home LRC and HLRC on a 32-processor Intel Paragon [47] .
on receiving invalidations, those sharers that hold locally modified copies of the page compute and return their diffs to the releasing processor, bringing it up to date. A subsequent page fault fetches the full page only from this last releaser, just as home-based protocols fetch it only from the home.
Adaptive Protocols
Much of the machinery developed to support multiple writers is not well suited to single-writer patterns. Recent papers [2, 24] propose adaptive no-home LRC protocols. If write-write sharing is not detected within a synchronization interval at run time, the protocol switches from multiple-writer to single-writer for that page, transferring the whole page when necessary and eliminating diffs and extra messages. These adaptive protocols are particularly well-suited to migratory sharing patterns, which have a single-writer within each interval. Due to its static assignment of homes to pages, a pure home-based protocol does not handle migratory sharing quite so well. Compared to the single round-trip of single-writer protocols, it generates one and a half roundtrip messages for each migration: propagation of the diff from the last writer to the home plus a roundtrip with the home to fetch the page. However, home-based protocols usually handle multiplewriter situations more efficiently than adaptive no-home protocols. A recent study [24] shows that an adaptive no-home LRC can outperform an HLRC protocol, but the comparison is not conclusive since few real applications are used, the number of processors is small, and the HLRC implementation is slightly different from that described in [47] . The tradeoff between these approaches needs to be better understood.
Alternative All-Software Propagation
An alternative all-software method to track and merge modifications in multiple-writer protocols is to maintain per-word dirty bits in software [4] . This requires instrumenting memory operations in the program, incurring runtime overhead. It has been shown to perform slightly better than diffs for migratory sharing patterns, but the only available comparison shows that (for a non-adaptive no-home LRC protocol) diffs perform better overall due to the cost of instrumentation [1] .
Simple architectural support can accelerate SVM protocols. Moreover, it can greatly impact high-level design choices for protocols as well as the results of performance comparisons.
Broadcast and Multicast
Broadcast or multicast support may make eager protocols more attractive since information can be multicast to all sharers on a release. Lazy protocols are inherently point to point. For example, the Cashmere DC protocol uses the broadcast support in the DEC Memory Channel interconnect to perform directory updates and synchronization [29] , as well as to to ensure a total ordering on operations. This clever exploitation eliminates the need for acknowledgments, and allows directories and eager coherence approaches to be used efficiently, eliminating time-stamps as well. However, the techniques rely deeply on the specialized support and it is unclear how well they will scale. The Brazos system uses multicast support in a mostly-update protocol, with protocol mechanisms for reducing the drawbacks of updates [42] .
Fine-grain Remote Writes
Several papers have suggested hardware support for fine-grain remote operations in the network interface [11, 35, 30, 29] . Recent real implementations include the AURC home-based protocol on the SHRIMP multicomputer, which uses the automatic update hardware mechanism [7] to snoop writes off the memory bus and propagate them to the home of the page if it is remote. This eliminates diffs, but can generate more traffic on the memory bus and network and relies on the ability to map pages write-through in the caches. The fine-grain remote writes can be used not only for application data but also for protocol meta-data. Results from the implementation show that AURC usually outperforms the all-software home-based HLRC [18, 7] , but not dramatically. The Cashmere system [29] does not rely on snooping but instruments shared writes to generate explicit remote writes as well, using the remote write support in the Memory Channel [17] (together with a multiple-writer DC protocol and the broadcast support). It achieves finegrain write propagation to a home without write-through caching, but incurs instrumentation overhead.
General Network Interface Mechanisms
In typical SVM implementations without a dedicated protocol processor, incoming requests interrupt the compute processor and are handled by it. The interrupt overhead is the most significant parameter of the communication architecture in determining the performance of an SVM protocol [6] . Using polling as an alternative to interrupts may or may not improve performance, depending on the interrupt versus polling and instrumentation overhead on the platform. These message handling costs can also affect the performance tradeoffs in protocol laziness, for both propagation and application.
Commodity network interfaces that support basic mechanisms to move data between the host memory and the network without interrupting the main processor can be used to accelerate SVM systems beyond just automatic write propagation support, and to develop protocols that neither use interrupts nor polling. Such a system has been built by programming simple remote deposit and remote fetch mechanisms into the nonintrusive Myrinet network interface, as well as a mechanism to provide mutual exclusion for locks, and by modifying the protocol propagation and application methods to exploit these mechanisms [5] . It has demonstrated substantial performance improvement across a range of applications, and shown that all three simple types of support are important for different applications. Further integration of the mechanisms into the memory system may help more. The appropriate compromise between performance and the commodity nature of the architectural support, as well as the synergy between layers, bear further research.
Latency Tolerance
Finally, since communication costs in SVM systems are large, latency tolerance plays an important role. Relaxed memory consistency models are used by virtually all SVM systems. Recent studies on systems with large interconnect latencies such as ATM indicate that prefetching and multithreading can be quite successful as well [34, 41] . However, when most of the communication cost is in the end-point overhead of message handling and interrupts rather than in the interconnect itself, as in most modern SVM systems, these techniques are less likely to be effective except when the overhead is offloaded to a separate communication assist.
Application-driven Evaluation and Performance Portability
A significant limitation of most of the studies performed in the past to evaluate protocol tradeoffs has been in the applications they use. "Standard" applications [44] that are used for hardware-coherent shared memory have only recently begun to be used, and most programs that have been used have been kernels with mostly very simple behavior. It is very important that performance and tradeoffs evaluations in SVM use a wider range of applications, classified according to both inherent sharing patterns and the patterns induced by interaction with system granularities [20] . The gap between hardware cache coherence and software shared memory is still quite large for several types of applications (e.g. [20, 22, 5] ), and software shared memory performance has not yet been proven to scale except for a limited class of applications. Some key bottlenecks have been shown to generate cascading effects in SVM. For example, expensive page faults cause dilation of critical sections which, in turn, amplifies lock contention and the waiting time for synchronization. Protocol changes should now mostly be driven by such bottlenecks discovered by running real applications. Protocol optimizations can be directed by compilers [14] , users [3] , or the runtime system [2, 24, 42, 33].
Another major area enabled by SVM progress is research in the application layer itself for these systems. Structuring applications or data structures more appropriately for page-grained SVM can improve performance dramatically [22] . While the restructurings are often algorithmic (e.g. reducing synchronization, even at the cost of imbalance or communication, and making programs algorithmically single-writer), the good news is that they are mostly performance-portable to hardwarecoherent systems as well [22] . Simultaneous research in applications and the systems layers, rather than treating either as fixed, is important to understand and exploit the synergies among the layers [40, 5] , truly understand the potential of SVM, and develop guidelines for performanceportable shared memory programming across hardwarecoherent systems and clusters, the two major emerging multiprocessor platforms. Tools to understand bottlenecks and check violations of consistency are also important.
Other Software Shared Memory Directions
This section presents recent important directions for software shared memory research: (i) comparing SVM with recently re-popularized fine-grained software shared memory approaches that do not rely so heavily on relaxed consistency models, (ii) using multiprocessor rather than uniprocessor nodes to reduce the frequency of software involvement, and (iii) scalability.
Fine-grain Software DSM
For software coherence at a fine or variable grain on clusters, access control is usually provided by instrumenting memory operations in the code rather than through the virtual memory mechanism which only deals with pages [39] . An advantage of fine granularity is simplicity, since a sequential consistency model can be used without suffering much false sharing and communication fragmentation. The programming model is exactly the same as in the more mainstream hardware shared memory. In addition, tools for code instrumentation operate on executables rather than on source code, a substantial advantage over traditional SVM when synchronization in the executable is not properly labeled or visible to the SVM system. However, code instrumentation is not always portable, it adds overhead on loads and stores and, since fine-grained communication may be more frequent, the approach depends critically on low-latency messaging. Recent optimizations that reduce instrumentation overhead have revived the fine-grained approach [38] . Several of the key optimizations so far are specific to RISC or even DEC Alpha architectures. Among protocol-level optimizations, support for multiple coherence granularities on a per data structure basis is particularly useful, though it does rely on support from the programmer.
A recent study compared the fine-grained and pagegrained approaches using four different consistency protocols on a fairly large and varied set of applications [48] . The platform provided a uniform hardware access control mechanism for both approaches but ran the protocols in software. The results (Table 4) show that for almost all applications the page-grain home-based LRC protocol used performs similarly to or better than a fine-grain SC protocol, except in one case where there is an overwhelming amount of synchronization (since the SC protocol does not postpone coherence activity to synchronization points, it can better tolerate finer-grained synchronization). Application restructuring aids SVM further. More research is needed to determine which approach is clearly superior to the other given instrumentation costs and future trends in communication performance [40] (generally, higher bandwidth and message handling costs favor SVM while lower latency favors a fine-grained approach). 
SVM across Multiprocessor Clusters
Another recent development is motivated by the increasing popularity of small-scale hardware-coherent multiprocessors and the ease of constructing "systems of systems" or clusters that use these multiprocessors as their nodes [46, 10, 23, 15, 45, 36, 5, 43, 37] . A software shared memory layer provides a uniform, coherent shared memory programming model rather than a hybrid messagepassing/shared-memory interface for this increasingly important platform. Also, because the local coherence and synchronization within a multiprocessor node is performed in hardware, the overall performance is expected to be better than that on clusters of uniprocessors.
Finegrained software shared memory systems have also been built across SMP nodes [37] . Using multiprocessor nodes has been found to indeed improve performance, with the extent of the improvement depending on the localization of communication and synchronization in the application. The caveat is that cross-node communication per unit time actually increases since the multiprocessor node speeds up computation faster than it reduces cross-node communication, so node-to-network bandwidth may have to be scaled with node size. Also, saturation of the SMP memory bus can sometimes hurt performance. The use of multiprocessor nodes may also change the tradeoffs in laziness and architectural support, especially as nodes become larger.
Scalability
Finally, the scalability of SVM and software shared memory in general is a largely unexplored terrain. Scalability can be achieved by increasing the number of nodes or the number of processors per node. For example, an SVM protocol may be implemented across clusters of distributed-memory, hardware-coherent machines rather than SMPs, with performance hopefully increasing as node size increases relative to the number of nodes. A start in the latter area has been made [45] , but with a relatively simple protocol. If this approach is successful, i.e. it competes well with full hardware-coherent largescale implementations, it may provide a way to extend the attractive programming model from moderate-scale hardware-coherent systems built in industry to much larger systems of systems, perhaps a viable way to construct truly large scale shared memory multiprocessors. Note that it is not only the scalability of performance that matters, but also the scalability of memory needed by the protocol for its data structures, so eager and home-based protocols may have an advantage as the number of nodes increases [47] , and protocols should scale their memory needs with the number of nodes, not processors [36] . Further protocol innovations may be needed, and the area bears further investigation.
Conclusions
There has been a lot of progress in shared virtual memory over the last several years. Protocols and their implementations have been improved steadily, fueled by relaxed memory consistency models, and are now at a stage of relative maturity. However, the performance of many applications that perform well under hardwarecoherent systems, especially irregular applications with a lot of synchronization, is much poorer. This is because of poor interaction with the protocols and the large granularities, as well as due to the performance characteristics of the commodity communication architectures. To better understand and reduce this performance gap, the next phase of research requires a much more integrated approach across the boundaries of applications, bottleneckdriven protocol improvements, and architectural support in communication architectures, understanding and exploiting the synergies among the application, protocol and communication layers rather than keeping some of them fixed. The result of work so far in all the layers together has demonstrated substantial performance improvements, with end performance quite close to hardware-coherent systems for small-scale SMP-based systems for several applications [5] , but even this has required application restructuring and there is still a long way to go. Research is also needed in guidelines for structuring applications for performance portability across hardware-and softwarecoherent platforms, in fine-grained versus coarse-grained software shared memory on clusters as communication architectures evolve in functionality and performance characteristics, and in scalability of performance and memory needs. All the research must now be closely driven by a wider range of real applications. Another challenge for SVM is whether programs will indeed have their synchronization properly labeled in a manner that can be detected by the release-consistent SVM system.
Interesting questions remain about the role of software shared memory. Is it best viewed as a cheaper but much lower-performance substitute for hardware-coherent machines (best used for program development or for very coarse-grained low-communication applications), or as a good way to support the shared address space programming model on the increasingly important platform of clusters of SMPs (perhaps with some simple architectural support), or as a truly competitive mechanism for extending the attractive programming model across sizeable hardware coherent multiprocessor nodes, thus removing the need for very large hardware-coherent systems to be built. The answer will depend on the performance that can be achieved and the amount of program restructuring and effort needed to obtain this performance. Despite the years of work on protocols, there is clearly a lot of exciting research left to do in pushing performance higher and understanding the role and potential of software shared memory as an alternative to hardware cache coherence or to message passing on clusters.
