This paper presents an evaluation of the impact of several architectural parameters on the performance of Virtual Memory (VM) based cache coherence schemes for shared-memory multiprocessors. The VM-based cache coherence schemes use the traditional VM translation hardware on each processor to detect memory access attempts that might leave caches incoherent, and maintain coherence through VM-level system software. The implementation of this class of coherence schemes is exible and economical: It allows di erent consistency models, requires no special hardware for multiprocessor cache coherence, and supports arbitrary interconnection networks.
Introduction
The performance gap between processors and memory has widened in the past. As the performance gap becomes wider, bus-based multiprocessors become more sensitive to available memory bandwidth the bus. Many interconnects can provide higher aggregate memory bandwidth than memory buses. Unfortunately, known interconnects such as crossbars, rings and routing networks, cannot take advantage of multiprocessor support implemented in many high-performance processors because snoopy cache coherence protocols require atomic broadcast capabilities to invalidate or update replicated cache blocks. Without atomic broadcast, the current alternatives are to employ directory-based protocols or to use compiler support This research was supported in part by NSF grant CCR-9020893, by DARPA and ONR under contracts N00014-91-J-4039, and by Intel Supercomputer Systems Division.
to insert cache ush and invalidation instructions into application-code.
The class of software coherence schemes we investigate can be implemented on very simple multiprocessor architectures based on arbitrary interconnection networks, without requiring hardware directories or compiler support. The virtual memory (VM) translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and then page access fault handlers execute the actions to maintain cache coherence.
This paper reports the results for VM-based protocols that implement sequential consistency and release consistency. We investigated several architectural parameters and their e ects on the performance of the VM-based schemes, when compared with the Illinois snoopy-cache protocol PP84] and a scheme with no support for shared data caching. Our results show that VM-based cache coherence is a practical approach for building shared-memory multiprocessors, and that it performs well under several di erent architectural settings.
VM-based Cache Coherence
This section describes the basic mechanism required for VM-based cache coherence schemes and the algorithms for implementing two memory consistency models.
Architectural Framework
VM-based cache coherence schemes can be implemented on very simple multiprocessor architectures using arbitrary interconnection networks. No snoopy cache or directory controllers are required in the system; the only requirement is that each processor have a memory management unit (MMU) whith support for the following, commonly available features: (1) At least three di erent protection states associated with a page: read-only (READ), read/write (WRITE), and no access (NIL); (2) physical caches which support page-based selective write-through or write-back, and selective invalidation of cached lines from software; and nally, (3) each processor has its own page table and TLBs can be reloaded from software.
The VM-based software maintains cache coherence at the virtual memory address space level. The page table entries (PTE's) for a shared page will always have the same virtual-to-physical address translation, while the protection bits can be di erent to re ect the coherence state of the page. VM software maintains the page tables at run time according to a chosen consistency model.
Algorithms for Sequential and Release Consistency
In all VM-based algorithms we assume private data is written back to memory on cache line replacement. For shared data, we rst present each algorithm assuming write-through, and then enumerate the changes required to use write-back. Sequential consistency (SC) Lam79] requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. Pseudo-code for the VM-based algorithm of SC is presented in Figure 1 . In SC additional hardware support is needed for interprocessor interrupts over the network. With write-back caches, the memory may not have up-to-date data when another processor requests access to a page; the algorithm must therefore ush cache entries associated with a page when undergoing a transition from a WRITE state. These transitions now take longer, but overall tra c along the interconnection may be reduced.
Relaxed A detailed discussion of VM-based RC is presented in Pet93], here we summarize the key features of the algorithms shown in Figure 2 : Synchronization variables are not cached and kept sequentially consistent by ensuring that memory accesses to these variables block until completed at the memory modules; page faults on accesses to data other than synchronization variables are used to keep track of how many processors have access to a shared page, and when it can become incoherent; a counter for each shared page, weaki, indicates the number of processors that have access to that page; and, a globally shared non-cacheable list, the WeakList, is used to keep track of pages that are simultaneausly shared and updated.
To implement VM-based RC with write-back for shared data, it is necessary for the previous owner of a lock to ush shared data updates back to memory before any new Acquire can succeed. In our algorithm we use an additional word for each lock, lastownerX , to identify on which processor the lock was last held, and a per processor list, called an UpdateList, which holds the page number of all shared pages that have been updated by that processor since it last serviced a ush request.
Performance Evaluation
We use trace-driven simulations to evaluate the VM-based cache coherence schemes. There are two main goals for this evaluation: compare the VM-based approach with the Illinois invalidation-based snoopy-cache protocol PP84] using a bus-based architecture, and evaluate the impact and implications of di erent architectural parameters such as crossbar networks, VM page sizes, write-back and writethrough caches, memory access latency, and cache sizes on these schemes.
To compare the VM-based methods with a hardwarebased cache coherence mechanism, we selected application programs implemented for multiprocessors with hardware cache-consistency. The application suite includes a Gaussian elimination program, matrix multiply, and four programs from the SPLASH benchmarks SWG91]: LOCUS, MP3D, CHOLESKY, and WATER.
The Simulator
We implemented a processor cycle level simulator based on a group of light-weight threads, each representing one Each node in the simulated multiprocessor architecture resembles a MIPS R3000, with a 20K direct mapped data cache, and an 8 entry write-bu er. The memory unit consists of interleaved memory modules and takes advantage of page mode. Every machine instruction takes one cycle.
The simulated multiprocessor architecture consists of 4 processors connected to a memory unit through a split transaction bus or crossbar network. Each bus has separate address and data lines with the following latencies: 1 cycle for arbitration, 2 cycles for address latching, 2 cycles to send 4 bytes of data, 10 cycles at the memory modules for the rst access to a page, and 2 cycles for subsequent accesses to the same page. The virtual memory page size is 4K bytes, and the cache line size is 32 bytes.
VM-based vs. Snoop and No
The results are based on the performance comparison of the following cache coherence protocols:
Snoop | The Illinois hardware invalidation-based coherence mechanism for snoopy caches. NO Normalized execution time is used to characterize the performance of the VM-based coherence schemes with respect to the hardware snoopy cache approach. In each graph, the execution time for the snoopy scheme represents 100%, and the execution time of all other schemes is normalized to it. Execution time is measured as the number of cycles required for all processors to simulate the whole trace.
A detailed analysis of the results presented in this section can be found in PL93] and Pet93]. For brevity we will summarize the most important points of the comparison among the coherency schemes when implemented on the bus based architecture, labeled Bus in Figure 4: 1. Although the most economical and the simplest method of cache coherence is implemented by not Figure 4 : Performance of the coherence schemes for four processors.
caching shared variables (NO), it has poor performance for all applications. 2. Among the VM-based schemes, VM-RC performs best for all applications. The overhead in VM-RC is lower than in VM-SC because of the relaxed memory consistency restrictions. 3. The performance of VM-RC, compared to Snoop, varies depending on the application. The overhead of VM-based RC is a function of the number of pages simultaneously shared and updated by the processors and the frequency of synchronization events that require coherency actions.
Relaxing memory consistency, hence synchronizing processors only when the application requires to do so, allows the VM-based RC cache coherence scheme to perform very well.
Crossbar Interconnections
The load on the memory bus can a ect the performance of the VM-based RC scheme greatly. Since one of the main advantages of the VM-based schemes is their independence of speci c interconnection topologies, here we present the results of simulating a multiprocessor architecture connected with a crossbar.
The results in Figure 4 labeled Xbar show that all applications bene t from architectures built with lower contention interconnects. The improvements vary depending on the ratio of memory access time to computation, and the amount of bus contention caused when network memory accesses are frequent. For MUL, which spends close to 85% of its execution waiting for memory accesses on a bus based architecture, performance is improved by 34% on a crossbar architecture with the VM-RC cache coherence scheme. Multiple paths to the memory modules reduce contention on the network by 70%. For applications, such as WATER, that spend much less time waiting for memory accesses, the improvement can be as small as 5%.
One would have expected the idle time due to writebu er stalls in GAUSS to be reduced substantially as well. However, when moving to the crossbar interconnection, in our simulations the memory interleaving changed to be page based, rather than, cache line based. Row updates in GAUSS are hereby serialized. Write-bu er stalls should be reduced, if cache-line based interleaving is added to the memory modules at each processor.
The VM-based RC schemes perform up to twice better than NO for all applications, and within 17% of the hardware snoopy cache coherence implementation. For applications that have high cache miss ratios, and therefore require high bandwidth from the interconnection, having multiple paths to the memory modules improves performance signi cantly. For the rest of this paper, all coherency schemes other than Snoop will be simulated for crossbar interconnections.
Page Size
The gure with the results for simulations with larger page sizes is omitted for reasons of space. Large page sizes increase false sharing prohibitively for the VM-based implementation of SC. On the other hand, for RC the results show that using large pages can reduce the overhead of its VM-based implementation. Pages of 16 Kbytes seem to be a good parameter for the set of applications we are studying, so the remaining results are based on 16K pages.
Write-Back Caches
For all applications other than MUL and WATER, between 50% and 77% of all network memory accesses to shared variables are due to cache write-throughs. This can cause a signi cant load on the interconnection network, as for GAUSS. Here we show the e ect of using write-back rather than write-through on shared updates. Figure 5 shows that write-back for shared data does not improve the performance of the VM-RC algorithms, performance actually degrades. LOCUS, WATER and CHOLESKY are a ected most negatively by the change to write-back. The execution overhead of VM-based RC is higher with write-back than write-through because of the additional time an acquiring processor needs to synchronize with the processor that previously held the lock. In addition, although network contention for shared cache reloads is reduced with write-back, the reduction is minimal. The best result is shown by MP3D, where overhead due to contention on cache reloads of shared data diminish from 9% to 4%. Write-back does not reduce the impact of writebu er stalls either: on a ush request all stores to the write-bu er occur at once, and the actual number writebu er stalls increases for most applications.
Even though the VM-based cache coherence schemes perform better with write-through for stores to shared data, VM-RC with write-back is still 10{40% faster than not-caching for most applications.
Memory Latency
As the gap between microprocessor and DRAM speeds continues to increase the performance gain from the VMbased schemes over NO should increase as well. To evaluate the e ect of memory latency, we simulate an architecture with lower bandwidth interconnections and slower memory modules: 2 cycles for arbitration, 4 cycles for address latching, 4 cycles to send 4 bytes of data, 15 cycles setup at memory for the rst access to a page, and 4 cycles for subsequent accesses to the same page.
As one expects, the performance of all applications deteriorates with longer memory latencies. Not caching is particularly costly on architectures with low bandwidth interconnections. Figure 6 shows the relative performance of all applications more precisely. In the graph, all results for the short latency interconnection are normalized to Snoop with sl, while results for the long latency interconnection are normalized to Snoop with ll.
In relative terms, the performance of VM-RC with respect to Snoop improves. This occurs because longer memory latencies have a larger impact on bus based architec- tures. On the average, network contention under Snoop triples with the simulated lower network bandwidth, while for VM-RC it increases by only little more than double.
In addition, the contention on the bus network is consistently at least twice, sometimes three times, the one for the crossbar. Interconnections such as crossbars alleviate the contention problem and the results show that the VMbased coherence schemes perform very well in this case.
Cache Size
Evaluating the e ect of larger caches on the application's relatively small data sets will give some insight of the impact of large second level caches on the relative performance of the VM-based schemes. Here we present the results of simulating 64K caches on each processor. The performance of the applications with Snoop and the larger caches improves by 10% to 70%, while improvements with the VM-based schemes are not as high, 5% to 58% respectively. Figure 7 shows the behavioral pattern of the VM-based schemes, when compared to Snoop and NO on the same architecture.
The rst thing to notice is that larger caches make the relative performance of the no-caching approach much worse with respect to Snoop. The cache miss ratio with Snoop is reduced anywhere from 23% (for MP3D) up to 93% (for GAUSS) by using the larger cache. The e ect of larger caches on the VM-based schemes varies among the applications. For LOCUS, WATER, MP3D and CHOLESKY the relative performance of the VM-based schemes stays fairly constant when compared to Snoop and therefore is much better than NO. For these four applications, the VM-based schemes remain within 33% of the hardware snoopying approach.
Although for GAUSS and MUL the relative performance of the VM-based schemes with respect to Snoop deteriorates signi cantly, the di erence is that for GAUSS this means that the VM-based algorithms actually perform much worse than Snoop (within 130% rather than -10% with the smaller cache), while for MUL the VM-based approaches are still better than Snoop, but with a much smaller margin (6% rather than 35%). From Figure 7 it is obvious that the reason why GAUSS' performance deteriorates is the time idle due to write-bu er stalls. Figure 8 shows the performance of the VM-based cache coherence protocols on architectures with four, eight and 16 processors. Each of the processors in the larger systems has the same characteristics as in our original simulations. In the graph, the four processor execution time for the Snoop scheme represents 100%, and the execution time of all other schemes is normalized to it.
Scalability of LRC
The results for the VM-based implementations of RC show that it scales to eight processors in a manner similar to Snoop. With 16 processors the performance of the VM-based RC algorithms scales better than Snoop's. The performance improvement of Snoop with 16 processors is minimal compared to the eight processor system. The maximum reduction in execution time is 23% for WATER. For all other applications it is less than 13%. On the other hand, the execution times of all applications, but GAUSS, improve by over 26% in VM-RC. The VM-based RC algorithms scale better than Snoop because of the much increased memory bandwidth. GAUSS' performance is limited by the write-bu er activity, which does not bene t as much from the higher memory bandwidth.
It is important to note, that for applications with very high cache miss ratios, such as MUL, increasing the bandwidth to the memory modules is very important. Figure 8 shows that MUL does not scale at all on the bus based architecture, because of the load on the interconnection. 
Conclusions
VM-based software cache coherence schemes are e ective and economical approaches for building simple sharedmemory multiprocessors. Our results show that the VMbased schemes can perform very well. Several parameters a ect the performance of the VM-based schemes. In this work we studied the e ect of page size, cache size, latency of memory accesses, and scalability on the relative performance of the VM-based schemes. Large pages are prohibitively expensive for the VM-based implementation of sequential consistency, but they can slightly improve the performance of VM-based RC.
When most of an application's working set ts in the cache, misses occur mainly due to coherence constraints, if in addition the cost of a coherence miss is high, like with the VM-based schemes, their relative performance is gets poorer when compared to an optimized hardware approach. On the other hand, when cache misses also occur regularly due to capacity limitations, which are less expensive on architectures with low contention interconnection networks, the performance gap between the hardware and VM-based approaches closes signi cantly.
Long memory latencies favor the relative performance of the VM-based approach because it can be implemented on non-bus based interconnections, therefore signi cantly reducing the contention on the network. Bus interconnections saturate very quickly when transactions take a long time to complete. Interconnections such as crossbars alleviate the contention problem and our results show that the VM-based coherence schemes perform very well in this case.
The scalability of the VM-based RC algorithms to a larger number of processors is better than the one of snoopy cache protocols because of the much increased memory bandwidth supported by the VM-based schemes.
