We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system. Our goal is to identify the workloads for which either of the schemes is significantly better. Our methodology improves upon previous analytical studies and complements previous simulation studies by developing a common high-level workload model that is used to derive separate sets of lowlevel workload parameters for the two schemes. This approach allows an equitable comparison of the two schemes for a specific workload.
Introduction
In shared-memory systems that allow shared data to be cached, some mechanism is required to keep the caches Permission to copy without fee all or part of this material is granted prowded that the copies are not made or distributed for direct commercial advantage. the ACM copyright notice and the title of the pubhcation and its date appear, and notice Is given that copying is by permission of the Association for Computmg Machinery. To copy otherwme, or to republish, requmes a fee and/or specific permission.
is attractive because the overhead of detecting stale data is transferred from runtime to compile time, and the design complexity is transferred from hardware to software. However, software schemes may perform poorly because compile-time analysis may need to be conservative, leading to unnecessary cache misses and main memory updates. In this paper, we use approximate Mean Value Analysis
[VLZ88] to compare the performance of a representative software scheme with a directory-based hardware scheme on a large-scale shared-memory system.
In a previous study comparing the performance of hardware and software coherence, Cheong and Veidenbaum used a parallelizing compiler to implement three different software coherence schemes [Che90] . For selected subroutines of seven programs, they show that the hit ratio of their most sophisticated software scheme (version control) is comparable to the best possible hit ratio achievable by any coherence scheme.
Min and Baer [MiB90b]
simulated a timestampbased software scheme and a hardware directory scheme using traces from three programs. They also report comparable hit ratios for the two schemes. However, they assume perfect compile-time analysis of memory dependencies, including correct prediction of all conditional branches, which is optimistic for the software scheme.
Owicki and Agrtrwal [OWA89] used an analytical model to compare a software scheme [CKM88] against the Dragon hardware snooping protocol [ArB86] for bus-based systems. They conclude that the software scheme generally shows lower processor efficiencies than the hardware scheme and is more sensitive to the amount of sharing in the workload.
The main drawback of their method is that the principal parameters that determine the performance of the two schemes are specified independently of each other, and therefore for a given workload it is difficult to estimate how the schemes would compare.
Furthermore, they assume the same miss ratio (0.4-2.4%) for private and shared data accesses in the hardware scheme, which is an optimistic assumption as shown in studies of sharing behavior of parallel programs [EgK89, WeG89] From the high-level workload model, we derive two sets of low-level parameters that are used as inputs to queueing network models of the systems with hardware and software coherence. We compare a software coherence scheme similar to one proposed by Cytron et al. [CKM88] to a hardware directory-based DiriB protocol [ASH88] for large-scale systems. Our conclusions also hold for the version control and timestamp schemes, as discussed in Sections 5 and 6. The goals of our study are to characterize the workloads for which either the software or the hardware scheme is superior, and to provide intuition for why this is so.
The rest of the paper is organized as follows. In Section 2, we discuss the important iksues that can result in performance differences between hardware and software schemes. In Section 3, we describe our common high-level workload model. In Section 4, we first describe the system architecture and cache coherence schemes studied in this paper, and give a brief overview of the Mean Value Analysis models for the systems. We then describe how the low-level workload parameters are derived from the highlevel workload model. Section 5 presents the results of our experiments. In Section 6, we discuss the overall results of our study, and comment on some related issues. Section 7 concludes the paper.
Performance

Issues for Hardware and Software Coherence
In this section we outline the important issues that affect the performance of software and hardware cache coherence schemes. There are two main performance disadvantages of directory-based hardware schemes. First, substantial invalidation or update traffic may be generated on the interconnection network. Second, memory references to blocks that have been modified by a processor but not updated in main memory have to go through the directory to the cache that contains the block.
The performance of software schemes on the other hand is limited by the need to use compile-time information to predict run-time behavior. The limits of this information may force software schemes to be conservative when (1) predicting whether certain sequences of accesses occur at runtime, (2) using multi-word cache lines, and (3) caching synchronization variables.
To detect stale data accesses, the compiler has to identify sequences where one processor reads or writes a memory location, a different processor writes the location, and the first processor again reads the location. In this case, the compiler has to insert an invalidate before the last reference. To identify when such a sequence can occur, the compiler may need to predict some or all of tlhe following: (a) whether two memory references are to the same location, (b) whether two memory references are executed on different processors, (c) whether a write under control of a conditioml will actually be executed, and (d) when a write will be executed in relation to a sequence of reads. If any of these is not precisely known, the compiler has to conservatively introduce invalidation operations, perhaps causing unnecessary cache misses. Note that future advances in compiler technology could permit (a) and (b) above to be predicted accurately, while (c) and (d) involve runtime behavior that cannot be known at compile-time.
In our analysis we explicitly model the problems of predicting whether and when a write is executed, and treat them separately from the first two sources of uncertainty in data dependence analysis listed above, In this context, we call a write that executes an actual write, whereas we say there is a potential write in the program when the compiler for the software coherence scheme has to insert an invalidate for reasons other than inaccurate prediction of memory access conflicts or processor allocation. Finally, all the software coherence schemes proposed so far require synchronization variables be uncacheablc, whereas many hardware schemes allows such variables to be cached. In the future, the effects of this difference can be mitigated by software techniques [MeSar] that make lwks appear more like ordinary shared data, For this reason, we do not model synchronization directl:y.
The High-Level Workload Model
Our high-level workload model partitions shared data objects into classes very similar to those defined by Weber and Gupta [WeG89] . We use five classes, namely, passively-shared objects, mostly-read objects, frequently read-written objects, migratory objects, and synchronization objects. Passively-shared objects include read-only data as well as the portions of shared read-write~bjects that are exclusively accessed by a single processor . The latter type of data occurs, for instance, when different tasks of a Single-Program Multiple Data (SPMDI) parallel program work on independent portions of a shared array.
1. Note that this is a generalization of the read-only (classdefined by Weber and Gupra. We use the term actively shared to collectively denote all classes of shared data that are not passively shared. Table  3 .1 summarizes the high-level workload parameters.
(The column of values gives the ranges used in our experiments.) As discussed earlier, we do not model synchronization objects separately, but expect them to behave like ordinary shared data once contention-reducing techniques have been applied [MeSar] . The parameters for mostly-read, frequently read-written and migratory data are further discussed below. These parameters are designed to capture the sharing behavior of the particular data class, so as to reflect the performance Section 2.
Mostly-Read Data
Mostly-read objects are infrequently, and may be read considerations discussed in those that are written very more than once by multiple processors before a write by some processor. An example is the cost array in a VLSI routing program which is read often by multiple processors, but written when an optimal route for a wire is decided. Even though actual writes to an object of this class are rare, there could be uncertainty in whether and when writes do occur, possibly causing a large number of unneeessay invalidations. We make the assumption that a processor always reads a mostly-read data element before writing it, so that a write always finds the data in the cache.
The parameters f~IM, lMR, and n~R describe accesses to mostly-read data, and are defined in Table 3 .1.
The feasible values of these three parameters are conStrtirted in the following way. Define ratioJ.fR to be the average number of compiler-inserted invalidates that a processor executes on a mostly-read data element in the interval between any two consecutive actual writes to the data element, averaged over the intervals when the processor does execute such invalidates.
From the definition, ratioM~21.
Since a processor reads a data element lMR times between compiler-inserted invalidates, Z~Rxratio~Rxn~R is approximately the total number of reads on a data element between two actual writes to the element.2 But the latter is exactly the overall ratio of reads to writes at rurttime, (l-fW IMR) /fW ,~R. Therefore,
This relationship is significant for two reasons. First, it relates the compile-time and runtime behavior of the program, and therefore the performance of the software and hardware coherence schemes for the given program. Second, it constrains the feasible parameter space to be explored in comparing the two schemes. reduction in miss rates to actively shared data due to spatial locality in the hardware and software scheme respcxtively Com 0.1-1.0 reduction in hh rates to actively shared data due to conservative prediction of memory access conflicts and processor allocation.
3,2. Frequently Read-Written Data
Frequently read-written objects are typically those that show high contention, such as a counter that keeps track of how many processors are waiting on a global task queue. Such data objects are written frequently, and also read by multiple processors between writes. Weber and Gupta show that this type of data can degrade system performance because they cause multiple invalidates relatively frequently.
Writes to this type of data may also be executed conditionally, but a relatively high fraction of these writes would be executed compared to the mostly-read data. As for mostly-read data, we assume that a processor always reads a frequently read-written data element before writing it.
f~ [RW, ZRW and %?W are defined in the same fashion as the corresponding parameters for mostly-read data (Table 3 .1). By definition, the fraction of writes to this class, fw IRW, is expected to be larger than fw l~R. Also, nRW is expected to be small. Similar to ratioMR, we can define ratioRw and estimate it as
Migratory data objects are accessed by only a single processor at any given time. Data protected by locks often exhibit this type of behavior, where the processor that is currently in the critical section associated with the lock may read or write the data multiple times before relinquishing the lock and permitting another processor to access the data. Migratoty data resides in at most two caches at any time. Again, we assume that a processor always reads a migratory data element before writing it. For migratory data, 1~1~is the average number of accesses to a migratory data element by a single processor before an access by artother processor.
Analysis of the Coherence Schemes
The high-level workload model described in the previous section is used to derive low-level parameters that are inputs to MVA models of the systems being compared. Before describing how the low-level parameters are derived, we state our assumptions about the coherence protocols and the hardware organization, and give a brief overview of the Mean Value Analysis models.
System Assumptions and Mean Value Analysis
We assume a system consisting of a collection of processing nodes interconnected by separate request and reply networks, each with the geometry of the omega network, with 2x2 switches.
We do not believe that the specific choice of network topology should significantly influence the qualitative conclusions of the study. Each node consists of a processor and associated cache, and a part of global shared memory.
Messages are pipelined through the network stages. We assume that buffers are associated with the output links of a switch and have unlimited capacity, and that a buffer can simultaneously accept messages from both incoming links. The parameters describing the architecture are given in On a write request to a line in shared state, invalidates are either sent from main memory to some average number of processors or are broadcast to all nodes in the system, consistent with a Diril? scheme. The requesting processor~is not required to block for the invalidates to compIete.
As we will see, one situation where sclftware coherence does better than DiriB is when a location is read and RFO is a read operation that procures the requested line in modified state in the processor cache to avoid a directory access on a subsequent write. Since the use of RFO could significantly change the performance of Dirill relative to software coherence, we model DiriB without and with RFO.
For software coherence, we model a scheme similar to the one proposed by Cytron et al. [CKM88] . The compiler inserts an invalidate instruction before each potential access to stale data, causing the data to be retrieved from main memory. Also, if a write to a shared location is followed by a read by a different processor, the compiler inserts a post operation that explicitly writes the line back to main memory. We assume that the processor is blocked for one cycle for each invalidate and post instruction, i.e. we assume that the processor does not have to block for the post to complete. This is consistent with not requiring a processor to block for invalidates in the hardware scheme. Read and write misses are identical in behavior as far as the network and main memory are concerned.
We use similar approximate Mean Value Analysis models of the system for both coherence schemes. The shared hardware resources in the system, i.e., the memories and the interconnection network links, are represented as queueing centers in a closed queueing network. The task executing on each processor (representing a single customer) is assumed to be in "steady state," executing locally for a geometrically distributed number of cycles between operations on the global memory. We assume that a global memory operation is equally likely to be directed to each of the nodes in the system, including the node where the request originates.
The probabilities of various global memory operations per cycle comprise the low-level workload parameters and are defined in Table 4 .2. These parameters are derived from the high-level workload model as explained in Section 4.2.
The MVA models used to calculate system performance tie similar to models developed by others for the analysis of different types of processor-memory interconnects [VLZ88, WiE90] .
The detailed equations of the model are given in [AAH9 1]. These models can be solved very quickly and have been shown to have high accuracy for studying similar design issues.
The performance metric we use is processor efi$ciency, defined as the average fraction of time each processor spends executing locally out of its cache. This measure includes the effects of hit rate and network interference in each of the schemes.
Deriving the Low-Level Workload Parameters
The low-level parameters for each coherence scheme are derived from the high-level workload model by calculating the probability that a reference of each class causes each type of global memory operation. The system parameters listed in Table 4 .1 are used in this derivation.
For the shared-data classes, the global memory access probabilities are calculated assuming a one-word cache line size, and assuming accurate analysis of memory access conflicts. Then, to account for the reduction in miss rates due to spatial locality, these global memory operation probabilities are reduced by the factor lochW or 10CW. Also, for the software scheme, the hit ratio of actively shared data is reduced by the factor cons to account for inaccurate prediction of memory access conflicts. The approach used in calculating the contributions of each shared class is described here, and the detailed equations for all the lowIevel parameters are given in Appendix A.
Mostly-Read Data. This type of data is read multiple times (/&fR times on the average) by a processor between compiler-inserted invalidates. The first read in each such sequence will be a miss for the software scheme, since it is preceded by an invalidate. Therefore, one in every lMR reads to mostly-read data causes a miss in the software protocol. In the hardware protocol each write causes one read miss for each of nJfR processors, on the average. The probability of a read miss is therefore nJ,fRx fW l~R. Of these read misses, 1 / n&fJ/ see the data in modified state (contributing top, ,Wd), while (n~R-l) / n~R see the data in state shared (contributing top, i,h).
Writes to mostly-read data do not cause misses with the software protocol, because we assume that they follow a read access. However, each write causes a post operation. In the hardware protocol, all writes to mostly-read data contribute to pW ISA. Furthermore, we assume that n~R is large enough that broadcast is required for invalidations. This is consistent with Weber and Gupta's findings, which showed that writes to mostly-read data caused an average of 3 to 4 invalidates even for 16 processor systems WeG89].
Frequently Read-Written Data. The contribution of this class to the probability of read and write misses is calculatedin the same manner as formostly-read data (when RFO is not included).
Since this class has a relatively high fraction of actual writes, the assumption that each write finds the data in shared state will be somewhat pessimistic for the hardware scheme because two consecutive writes could be executed by the same processor, with no intervening reads by other processors. This assumption is also somewhat pessimistic for the software scheme, since not all writes would cause a post operation.
Because fewer processors are expected to read between writes for this class (n~w is low), we assume that all writes to data in shared state cause individual invalidates to be sent from main memory. Therefore, the contribution to pi~inv. is the same as to pW I=h. An average of nRJ$/-l invalidates are required for each such write.
When RFO is included, every read sees the data in modified state, writes do not miss, and no invalidations are required.
Migratory
Data.
For migratory data, the first access in a sequence of lMIG accesses is always a read by assumption. We assume that this type of data is written at least once for each sequence of accesses by a processor. Hence there is a read miss once per lMIG accesses for both protocols. Therefore, for the hardware protocol, the first read by a processor in a sequence always finds the data in modified state. Writes in the software protocol do not miss since they always follow a read. In the hardware protmol without RFO, the first write of the sequence finds the data in shared state, causing a miss and causing an individual invalidate to be sent to exactly one processor. This miss and the invalidate are avoided, however, when RFO is included.
Results
We have used our models to perform experiments comparing the hardware and software coherence schemes. The constraints on the high-level workload model parameters discussed in section 3 (equations 3.1 and 3.2) allow us to explore the feasible workload parameter space completely. The ranges of workload parameter values that we consider reflect the characteristics of the shared data classes, and are given in Table 3 .1. The system parameters (except cons) are held fixed throughout our experiments, and the values are given in Except for lockW and 10CW, we believe that varying the other parameters will not affect the conclusions of our study. The value of 1 for lochW and 10CWcould be pessimistic for the respective schemes since they assume that spatial locality is not exploited. We will comment on these assumptions at the end of the section. Unless otherwise indicated, the ext)eriments for hardware do not assume RFO. cons = 1. In Section 5.4, we study the effect of smaller values of cons. Since the different data classes are independent of each other, their effects in isolation can be combined to draw conclusions about the overall performance of the software and hardware schemes. We discuss the overall performance results in Section 6.
The Mostly-Read Class
In figures 5. l(a) and (b), we pIot the efficiency of the hardware and software coherence schemes as the fraction of shared data references that are to mostly-read data~~R) is varied from O to 1, while all other shared data is passively shared. The hardware scheme is sensitive to fv IMR, the fraction of writes to mostly-read data at runtime, and n&fR, the mean number of processors that access a mostlyread data element between consecutive writes to the element. nM~is held constant at 4 in both graphs, but the results are similar if fw I~R is held constant and n&ff/ is varied. The software scheme is sensitive to lMR, the mean number of reads by a processor between compiler-inserted invalidates. 
Contours of constant Efficiency(software)
Efficiency(hardware)" most pessimistic case for the software scheme. 5.1 (b) shows the results for l~R=8, where the software scheme has become competitive with the hardware scheme.
In figure 5 .l(a) we observe that as~WlJ,fJ/ increases, the efficiency of the hardware scheme decreases, while the effect on the software scheme is insignificant. while the hardware scheme is independent of lMR. Note here that the values fw l~R = 0.1 and fw IMR = 0.O.5are not feasible for Figure 5 , l(b) for n~~= 4 and lMR = 8, since they cause ratio&fR to be less than 1. This restricts the region over which software would be superior to hardware.
We next identify the regions in the parameter space over which one of the schemes performs better, For IMR = 1 and 8 in figures 5.2(a) and (b) respectively, we plot the contours of constant ratio of software to hardware efficiency over a range of values of ratioMR, with the fraction of shared data references that are to mostly-read data varying from O to 1. In these experiments, ratioMR is varied by fixing nJfR=4 and varying~WIMR. Similar results are obtained when fw I~R is fixed and nMR is varied.
For low values of /~R, we observe that the hardware scheme is significantly better (more than 20% better) than the software scheme if more than 2090 of the shared data is mostly-read and rati~~R is greater than 3. In this case, the hardware scheme is superior to the software scheme for most of the feasible parameter space. Software coherence is more than 10% better than hardware only for very low rati~MR. However, for /MR> 8, the software scheme becomes competitive with hardware over most of the feasible parameter space.
The Frequently
Read-Written Class
The parameters related to the frequently read-written class of data, lRW, nfiw and fW,~W, are similar to those for mostly-read data, but their values vary over different ranges, thus distinguishing the class.
The contour plots shown in Figure 5 .3 give quantitative estimates of the relative performance of software and hardware coherence over the parameter space. As in figure  5 ,2, we use ratioRw to reflect the relationship between the behavior of the two schemes. Again, we vary ratioRw by holding nRw = 2 and /Rw = 1 constant and varying~WIRw. As for mostly-read data, the resuhs are similar if nR~is varied instead of~WlR~. We observe that the hardware scheme is more than 20% better than the software scheme for ratioRw23 and~~w>().s. However, we expect that in many programs, less than 20% of shared data references would be to this class~~w<0.2) since it leads to low processor efficiencies for any coherence scheme. Within this range of values, the software scheme is within 2090 of hardware coherence in performance. For higher values of /RW, the region for which software is comparable to hardware increases.
Since the RFO optimization may improve the performance of the hardware scheme for frequently read-written data, we examine how the relative performance of the two schemes changes with this optimization.
The efficiencies for the cases without and with RFO are shown in Figures  5.4(a) and (b) respectively. Surprisingly, the RFO optimization degrades the perform ante of the hardware scheme, removing its advantage over the software scheme in regions where it dominates without RFO, for the entire parameter range that we explored.
The reason for this counterintuitive result is as follows. Without RFO, only the reads that follow an actuaf write incur a miss, requiring a global memory access, and only the first of these requires three traversals of the network.
With RFO, every read incurs a miss for l~w = 1 (here we assume that a data element is read by some other processor between successive writes by any processor), and requires three traversals of the network since the line is always held in modified state. When even a small fraction of the potential writes are not executed, the loss in efficiency due to the extra misses is not compensated for by the lack of misses when the writes occur, as shown in the plots for rati~RW= 1.16 VWlJ/w = 0.3). Another point of interest is that, with RFO, the hardware and software schemes both have the same miss ratios for IRW=l, but the software scheme has a lower cost per miss. In general, relative miss ratios do not completely reflect the difference between hardware and software schemes because of differences in network traffic and miss latencies.
The Migratory Class
The only parameter for migratory data is lMIG, the average length of a sequence of accesses by a single processor. Figure 5 .5 shows the contour plots for the relative efficiency of the software and hardware schemes with varying amounts of migratory data and lMIG. The hardware schemes with RFO (solid lines) and without RFO (dashed lines) are shown. All other shared data is assumed to be passively shared. We observe that the hardware scheme consistently performs worse than the software scheme. This is essentially due to the deterministic behavior of this class of data. Without RFO, the difference is more than 2070 for a large range of operation. The RFO optimization brings hardware to within 209'o of the software scheme over the entire parameter space, but does not make the hardware scheme outperform the software scheme, This is because, even though the use of RFO avoids the miss on the write for hardware, the read miss requires an extra hop to retrieve the data. Hence, the software scheme is always better than the hardware scheme for migratory data.
The Effect of Conservative Analysis of Memory Conflicts
The above experiments assume that conflicting memory accesses can be accurately identified at compile time. To analyze the effect of this assumption, we studied the effect of reducing hit rates to actively-shared data in the software scheme due to conservative analysis of conflicting accesses (cons < 1). Since the main difference between the hardware and software schemes occurs for mostly-read and migratory data accesses, we assume only these two classes of actively shared data in our experiments. Figure 5 .6 plots the ratio of the efficiency of the software scheme to that of the hardware scheme with &+ ranging from O to 1, and fMw = 1-fMR,withseparate curves fOr different values of cons. The parameter settings used were those for which software had comparable performance to hardware coherence for cons = 1. We find that with up to about 10% reduction in hits due to conservative analysis (cons20.9), the software scheme stays within 10% of hardware. For more than 15?70 reduction in hit rate, the software scheme becomes more than 20% worse than the hardwme scheme. (cons 2 0.9), the software scheme is competitive with the hardware scheme for most cases. The most important case for which hardware coherence significantly outperforms software coherence is for the mostly-read class of data. With a high fraction of this class of data, if less than half of the potential writes detected at compile-time are executed, the hardware scheme can be more than 30% better than the software scheme. The hardware scheme is also significantly better with high fractions of frequently readwritten data, when rafioRw is high. However, we do not expect parallel programs to contain such high proportions of this class of data. Otherwise, the software scheme performs within 10% of the hardware scheme for most cases. For migratory data, the software scheme consistently outperforms the hardware scheme by a significant amount. The RFO optimization for the hardware can substantially reduce this difference, but does not make the hardware scheme perform better than the software scheme.
The chief significance of these results is in showing the effect of various types of sharing behavior on relative hardware and software performance. For data that consists of conditional writes that are performed infrequently at IUnt.ime (high VdUW Of ratioJ/w and rati~MR), the SOftWare scheme performs poorly compared to the hardware scheme. This suggests that if data with many conditional writes occurs frequently in parallel programs, some mechanism to handle these writes is essential for a software scheme to be a viable option. None of the software schemes proposed so far incorporate such a mechanism. Since the result of conditional branches cannot be predicted at compile time, some hardware support appears necessary so that the compiler can optimistically predict branch outcomes, while the hardware takes responsibility for ensuring correctness when a prediction is wrong.
Although we have specifically modeled the scheme described by Cytron et al., we believe our results apply equally to the Fast Selective Invalidation scheme [ChV88] and to the timestarnp based [MiB90a] and version control schemes [Che90] . The Fast Selective Invalidation scheme has been shown to be very similar to the Cytron et al. scheme in terms of compile time analysis and exploiting temporal locality. The timestamp-based and version control schemes have been shown to perform better than the scheme by Cytron et al., but our assumption of cons= 1 for the Cytron scheme makes it comparable to these more efficient schemes. Furthermore, neither of these schemes can effectively handle potential writes, and hence suffer as much from such conservative compile time predictions as the Cytron scheme.
Finally, all our results have assumed Zochv = /ocW = 1, i.e., neither scheme exploits spatial locality for actively-shared data. It is not known if software coherence schemes can effectively use multiple word blocks. It is also not known if multiple word blocks are desirable with hardware coherence schemes in large multiprocessors.
If hardware schemes are shown to exploit significantly more spatial locality than the software schemes, our results no longer hold.
Conclusions
We have used analytical MVA models to compare the performrmce of software and hardware coherence schemes for a wide class of programs.
Previous studies have yielded seemingly conflicting results about whether software schemes can perform comparably to hardware schemes. The conflict arises because the different studies make varying assumptions about the behavior of parallel programs.
We have characterized the workloads for which each of the two approaches is superior. There are two principal obstacles to such a study: (1) the sharing behavior of parallel programs is not well understood, and (2) for a specific workload, the relative performance of hardware and software schemes depends on the amount of runtime information that can be predicted at compile time. Our approach has been to use a high level workload model in which (1) shared data is classified into independent classes, each of which can be characterized by very few (l-3) parameters and studied in isolation, and (2) the relationship between the compile time and runtime characteristics is captured in a manner that can be related to the high level program, independent of the specific coherence scheme. The high level workload model is used to generate the workload parameters of the MVA model for each of the schemes, thereby allowing an equitable comparison of the schemes.
Quantitative data and intuitive explanations of the results were given in Section 5. The main conclusions of our study (assuming the software and hardware schemes exploit spatial locality equally) are as follows: q Software schemes perform significantly less well (i.e., have at least 20% lower processor efficiency) than hardware schemes only ifi (1) cons< 0.85, i.e., the hit ratio to actively -sh~ed data is reduced by more than 1590 because of conservative estimates of when two memory accesses conflict, or (2) less than half the potential writes are executed, on the average. Several important programs may fall under the category for which software coherence is significantly less efficient than hardware coherence. For example, detecting memory conflicts at compile-time for programs that make heavy use of pointers, such as operating systems and Lisp programs, could be difficult, i.e. cons would be low. On the other hand, for well structured deterministic programs, our results show that software schemes are comparable and in some cases better than hardware schemes. Many scientific programs fall under this class. Our study motivates the need for more work on characterizing parallel program workloads, and the relationship between compile time and rtmtime parameters of parallel programs.
Once such a characterization has been made, our model and its results can be used more effectively,
The table entries for the software scheme are given in Table Al , while those for the hardware schemes without and with RFO are given in Tables A2 and A3 respectively. 
