We present an analytical model of a cache coherent shared-memory multiprocessor and compare the results obtained with those from an execution-driven simulation of the same system. Our objective is to evaluate the accuracy of analytical models of this type of system, and in particular to identify the principal sources of error in the modelling of the coherency protocol. The analytical model rst derives equilibrium cache line state probabilities which are then used to determine the expected long term message tra c generated by each coherency operation. These tra c rates in turn form the inputs to a queueing model of the processing nodes. Performance measurements such as processor and bus utilisations, mean queue lengths and read/write latency then follow. Validation of the model using synthetic workloads that exercise the whole of a portion of distributed memory of known size shows excellent agreement with respect to simulation. We also consider a \real" benchmark, taken from the Stanford SPLASH suite, which has interesting implications for the parameterisation of our model. The models still validate well but we speculate some sources of discrepancy due to limitations in both the analysis and the simulation, suggesting how these may be overcome.
Introduction
A topic of considerable interest in parallel computing in recent years has been that of shared-memory computer design and, in particular, the analysis of various caching strategies for maintaining a global coherent address space.
The simplest shared-memory machines are traditionally bus-based and bene t from the broadcasting properties of the bus which both serialises non-local accesses to shared memory and enables all memory tra c to be monitored simultaneously by all nodes. It is well known, however, that buses saturate quickly as the number of processors increases. More recent research has been concerned with scalable systems where the bus is replaced by a more general interconnection network. In these systems coherency has to be maintained by complex message passing protocols which replace the relatively straightforward \snooping" schemes used in a shared bus architecture 2]. A read or write request from a processor may now initiate a signi cant number of coherency messages which must be individually routed to other nodes. The interesting issue now is the extent to which this additional overhead elongates the optimal memory access time for a given protocol and hardware implementation.
At the design stage performance questions such as this have to be answered by appealing to a performance model of the proposed system. From the engineering point of view it is important to be able to predict how a particular architectural design will perform before the system is built. A change to the design of a processing node, for example, may a ect the ow of coherency tra c and create an unexpected bottleneck. Equally, for a particular type of workload it is important to be able to explore the relative performance of di erent protocols or individual protocol parameters. Experiments with other hardware parameters such as cache line size are also important at the design stage.
One approach is to develop a detailed simulation of the proposed design and to drive the simulation either using address traces (synthetic or program-generated) or directly from an executing program 3] . Whilst this may yield high accuracy for particular workload/architecture speci cations it is a labour-intensive task requiring often large and detailed simulation codes to be written, tested and then executed. Development times are often long and detailed simulation runs typically incur long execution times since the response to each read/write by each processor is modelled in software. The inherent complexity of the simulation also makes it prone to logical errors which in turn can lead to unreliable results.
An alternative, but complementary, approach is to develop a more abstract mathematical model of the system and to solve that model using a combination of established analytical results and numerical techniques. A number of such models have been developed for shared-memory computers, with varying levels of detail 1, 5, 10] and numerical predictions have been produced for a range of architectures and coherency protocols. However, validation has generally been with respect to stochastic simulations which make similar model assumptions and it is far from clear how e ective they are at predicting accurately the behaviour of a real system running real parallel programs.
In this paper we show how a distributed cache shared memory multiprocessor can be modelled using a general-purpose analytical approach which can be adapted to di erent architectures and coherency protocols. Apart from the model of the coherency protocol itself, concerned with cache line states and bus and network tra c, the queueing model of the nodes introduces interesting problems in its own right. We use some numerical predictions from the model to validate it against an existing execution-driven simulation of the reference system 9]. The main validation presented here is with respect to a specially constructed benchmark program; this is a parallel C program which can be con gured to use memory in a controlled way for experimental purposes. This enables us to construct a well-understood workload in order that direct quantitative comparisons between model behaviour and simulation behaviour can be undertaken. The validation exercise is inspired by a recent paper (reference omitted) which presents a very detailed model of the protocol 4] but with no accompanying validation. This model was developed in conjunction with a team of commercial computer architects and the importance of validation in this context cannot be understated.
Finally, we consider a \real" benchmark application, the MP3D particle simulation code from the Stanford SPLASH suite, which has interesting implications for the parameterisation of our model. Not surprisingly, validation becomes more problematic but reasonable agreement is still observed and limitations in both modelling approaches are revealed.
The rest of the paper is organised as follows: Section 2 gives an overview of the architecture used in this study, Section 3 describes the analytical model of the system, Section 4 presents a comparative analysis of the results of both models, and Section 5 gives a summary, conclusions and some pointers to future work.
System Architecture
The coherency protocol, which is described below, is similar to that in 12]. The machine consists of K physically identical nodes, each with a processor, a portion of the global memory and a second-level cache which is kept consistent across the machine in accordance with the coherency protocol. The communication network is taken to be contentionfree and with low latency so that the communication delay is proportional to the length of the message being sent. Although such a network of any size does not exist as yet, recent work in optical communication technology 8] suggests that networks with similar properties may be feasible in the near future. In any case, the assumption is justi ed here since it is the model of the protocol which is more of interest than the absolute performance of the chosen architecture. It would, of course, be straightforward to incorporate an established sub-model of, for example, a crossbar network or multi-stage interconnection network but this would signi cantly cloud the issues whilst allowing nothing new to be learned. The assumed node structure is shown in Figure 1 . 
The Coherency Protocol
Memories and caches are divided into lines each comprising a xed number of blocks of data, with associated tag information. It is assumed that there are N such shareable lines in total and each cache is assumed to have a capacity of n lines. The memory within each node is large and for reasons of cost is constructed from DRAM. Caches, on the other hand, use static RAM and special-purpose logic to achieve high-speed associative look-up.
The coherency protocol is invalidation based so that a write to a line may only proceed after all other cached copies of the line have been removed. This is achieved by a singly linked sharing list. The address of a given line uniquely determines the home node of that line; in the absence of any sharers, the line exists solely in the DRAM associated with the home node. A cache may contain a copy of a line from the local DRAM but this copy does not form part of the sharing list of the line.
The home node is xed, c.f. a cache-only memory architecture (COMA) where, in e ect, all memory accesses are associative, e.g. 7] . A read miss to the line from a processor in another node will cause a copy of the line to be forwarded to the requesting processor where it is cached in the second-level cache; at the same time a link is made to this node via an additional pointer eld in the home copy of the line. Subsequent readers will be added to the sharing list in a similar manner with new sharing list entries added at the head. Figure 2 shows the situation when nodes j and k each have a cached copy of a line whose home is at node i. Note that the processor at the home node of a line may cause a copy of the line to be read into its second-level cache so that two copies may potentially exist at the same node.
≡
Node i Node j Node k
Figure 2: Sharing lists
In order to write to a line all other copies (and the home copy) must be invalidated before the write may proceed. This entails sending an invalidate request to the home node which marks the home copy as being invalid and then forwards the request down the sharing list. With the exception of the writing node all sharing list entries will be invalidated by setting a bit in the associated line in the second-level cache. The home node is locked for the duration of this operation. The last entry on the sharing list will respond to the invalidate request by sending a completion message to the requesting node. In the event of the write being a miss this message will also carry a copy of the line. A nal message is sent by the requester to the home node which releases the home lock.
When a write is complete the locally cached copy is marked as being dirty, i.e. inconsistent with the original home copy. So long as the line remains in the dirty state the processor may write to it repeatedly without incurring coherency tra c.
If another processor tries to read a line that has been written by a remote processor, the read request sent to the home node is forwarded to the rst entry of the sharing list. This will subsequently supply a (valid) copy of the line to the reader and the home node, and all sharing list links will be updated. At this point both copies of the line are marked as clean.
If a second-level cache line forming part of a sharing list is displaced following a miss on another location which maps to the same line, the cached copy has to be \unhooked" from the list. This is achieved by sending a message to the home node which passes it down the sharing list until it reaches the entry preceding the unhooking node. The identi er of this node is then passed to the unhooker which decouples itself from the list by replying with the identi er of its successor. This is used to update the list pointers in the obvious way. We assume that the pointers associated with the second-level cache lines are stored in the network controller so that pointer maintenance and traversal can be performed without generating internal bus tra c.
Note that a line may be copied to the cache at the home node. In this situation the line state is maintained as for any other cached copy except that it does not explicitly appear in the sharing list for that line. Status information at the home memory indicates whether a copy of the line is held in the local cache.
The second-level cache line states are therefore as follows:
HXC Home Exclusive Clean|The line is cached at the same node as the home copy, and is the only cached copy. HXD Home Exclusive Dirty|As above, except that the line has been written and so the copy in memory is out of date.
HS Home Shared|The line is cached at the home node and there is at least one other copy cached at another node. The cached copies are consistent with the memory copy.
CX Client Exclusive|The home of the cached line is on another node. This is the only valid cached copy. CHD Client Head Dirty|The home of the cached line is on another node. This copy is the rst on the sharing list, and the home memory copy is out of date.
CHC Client Head Clean|As above, except that the home memory copy is coherent. CS Client Shared|The home of the line is on another node, and other copies exist.
INV Invalid|The line contains no usable information.
A fuller set of line states could be de ned, for example distinguishing clean and dirty shared states. However, the coherency protocol can be speci ed satisfactorily with this parsimonious set which we therefore adopt for brevity and e ciency.
Coherency Operations
Read/write requests from the processor may result in certain operations being applied to the sharing lists as listed below. Note that two message types are distinguished: short messages which contain only control information (e.g. for managing updates to a sharing list) and long messages which contain both control information and a line of data. Long messages are typically about an order of magnitude longer than short messages and an important objective of the protocol is to avoid sending long messages whenever possible.
Creation (CR) A read or write miss on a line in state INV not cached by client nodes. If a client, the processor sends 1 short message to the home node and receives 1 long message from it containing a copy of the line. The line transits to state HXC, HXD, CHC or CX depending on whether or not the home copy is on the same node as the requesting processor and whether the request was a read or write.
Addition (AD) A read miss on a line in state INV but cached by client processors. The processor sends 1 short message to the home node, which supplies the data if it can, or else forwards the request to the sharing list head. The requesting processor receives one long message containing the line and the sharing list pointers are updated accordingly. The line transits to state HS or CS depending on the location of the home node.
Reduction (RE) A write hit on a (non-invalid) line. If a client, the processor sends one short message to the home requesting invalidation of the chain. In any case, the home sends a short invalidation message down the sharing list, causing all copies to be invalidated. If the requesting node is not the last entry in the sharing list, it reads and invalidates the line in its cache, and sends it to the requesting node. The line transits to state HXD or CX depending on whether the processor is at the home node. Note that write to lines which are HXC, HXD and CX require no messages are sent { only local updates to the memory and/or cache are required.
Deletion-Creation (DC) A read or write miss on an uncached line whose address maps to a line not in the invalid state in the cache. This must rst be displaced from the cache which involves a deletion: an \unhook" message is sent to the home node which is forwarded down the sharing list. On average this will traverse half the length of the sharing list before arriving at the unhooking node|a short message is associated with each hop. A new list is created o the line being read or written as in CR above.
Deletion-Addition (DA) A read miss on an already cached line whose address maps to a line not in the invalid state in the cache. This must rst be displaced from the cache as above. The node is added to the new sharing list as in AD above.
Deletion-Reduction (DR) A write miss on an already cached line whose address maps to a line not in the invalid state in the cache. This is rst displaced and the list associated with the line being written is invalidated. The reduction is similar to RE above. The line transits to state HXD or CX. Read Hit (RH) A read hit on the second-level cache. The line state does not change although bus tra c is generated as a result.
Invalid-Reduction (IR)
A
The Analytical Model
A processor alternates think periods and periods when it waits for a memory access to be serviced. Let the think period have a mean of 1/ 0 . After a think period, the processor generates a memory request and this request may or may not invoke a transaction to a remote node. During the time a request is processed, the processor is idle. We aim to determine the probability, , that a processor is busy doing useful work. We will write = 0 which is the net rate at which a processor generates read/write requests to the memory system.
The second-level cache is taken to have hit/miss rates of rh and rm for reads and correspondingly wh and wm for writes. For convenience we write r = rh + rm and w = wh + wm . These are parameters of the workload since they express in some way the degree of locality in the application.
State Transitions
Ultimately we need to determine the message tra c that is generated by each processor in servicing memory requests; this will include messages required by the cache coherency protocol. Assuming that all the processors behave in the same way, we begin by deriving the equilibrium cache line state probabilities. To do this we assume that the evolution of the state of a given line in a cache (we shall refer to this as the observed cache line) follows a Markov process, independent of the states of other lines. This process is irreducible, aperiodic and has a nite state space. It thus has a steady-state.
We de ne the following: P u { the probability that a line is uncached remotely so that the home copy is the only copy in the machine, i.e. the probability that no sharing list exists for that line; P loc { the probability that an address maps to the local memory of a given node (a workload parameter, equal to 1=K for uniform memory access); P hv { the probability that the home copy of a given line in DRAM is up to date; P hd { the probability that a locally cached copy of a line in DRAM is dirty (i.e. the DRAM copy is out of date and the cached copy is the only valid copy anywhere in the system); P 1 { the probability that a sharing list has exactly one element; P second { the probability that a sharing list element is second in a list of length greater than 1.
For convenience we will also write the complementary probabilities P c = 1 ? P u ; P rem = 1 ? P loc and P hi = 1 ? P hv .
Using s to denotes the set of states HXC,: : :,INV the generators of the above Markov process are given by the transition rates below; note that we have divided each by the factor which is the rate at which a processor issues memory requests. The symbol denotes the mean length of a sharing list at equilibrium and results in an approximate rate. is currently uncached (P u ) and located in the local memory of the requesting processor (P loc ). The factor 1=n is the probability that the read request maps to the observed cache line. Note that remote operations can also induce state transitions locally. For example, the transition HS ! HXC can occur if a remote processor performs a miss ( rm + wm ) on a cache line which currently holds a copy of the observed line. The transition occurs when the processor is the only other one holding a copy in the machine (P 1 ). The other term ( rmPuPloc n ) covers the general case of a transition into state HXC of a read miss on a locally held line. Finally, note that the transition s ! INV corresponds to invalidation|any remote write operation to a line cached locally will cause the line to be invalidated.
The factor (K ? 1) here is the number of remote processors which can issue such a write.
The balance equations can be derived from the above transition rates in the usual way|see for example 6] and are solved iteratively. We will write q j ; HXC j INV to denote the equilibrium probability that a cache line is in state j.
Sharing
The quantities , P 1 and P second above require the distribution of the number of sharers of a memory line to be known. This is produced from a separate Markov model of line sharing, taken from the point of view of the memory.
We again assume that the evolution of the number of sharers of a memory line follows a Markov process, independent of the states of other memory lines. The model can be solved using standard techniques since the transition rates are expressed solely in terms of known model parameters.
The Markov process state transition diagram for the number of sharers is shown in Figure 3 . Note that the state 0 covers both the case where there is no cached copy of the line and the case where the only cached copy is at the home node. This state can be entered from state 1 as the result of a displacement at the (only) remote node with a copy of the line, and from any other state as the result of a write to the line from the home node. The former occurs with probability P miss =n and the latter with probability w P loc =(N=K). The state 1 can be reached from state 0 as the result of a remote read to the line (probability (K ? 1) r P rem =(N=K)), from state 2 as the result of a displacement (probability 2P miss =n), and also from any other state as a result of a write to the line from any non-home node (probability (K ? 1) w P rem =(N(K ? 1)=K)). The transition probabilities for the general case are shown in the diagram. The model is solved to obtain the mean length of the sharing list, , and the equilibrium probability P i that the sharing list is of length i; 0 i K ? 1. We can now express the probabilities de ed earlier: 
List Operation Probabilities
We next calculate the probability that a processor emerging from a think state invokes a given list operation (the probability distribution of the state at such instants is the same since the memory access stream is assumed Poisson). These are summarised in In order to capture this variability in the model we de ne a number of bus transaction classes and specify the expected number of each class for each operation/state pair in the form of a table T. We also de ne similar tables S and L representing the number of short and long messages for a given operation/state pair.
For reasons which will become apparent, we divide the classes into four groups. Transactions in the rst group are initiated by the processor and require just a cache bus transaction; those in the second require just the memory bus; the third and fourth groups contain those that require both buses, distinguished by which buses they have to queue for. Only transactions initiated from the memory bus can claim both buses; if a cache bus transaction requires the memory bus as well, the cache bus is released prior to queueing for and subsequently claiming the memory bus.
We label the classes C 1 ; C 2 ; : : : and specify the service times for each bus of class i in clock cycles in Table 2 . Note that each class belongs to exactly one group. We write h b;i for the holding time (in seconds) of bus b for a class i transaction, b 2 fc; mg. Thus h b;i = y b;i t clock where t clock is the clock cycle time and y b;i is an integer number of cycles. These times correspond to the constants used in the execution-driven simulation to determine the various bus delays.
The entries of T are speci ed by Table 3 . The rst two columns specify the operation and state. The fourth column details the bus transaction classes that are invoked in order to complete the given operation in the given state, together with the number of short and long messages sent. Since each operation/state pair may produce a number of transaction sequences depending on whether a particular line is local/remote, dirty in the home cache or valid in the home memory, the corresponding probability is listed separately with each row.
Note that reductions are a special case since the reducing processor sends and receives as many short messages as there are members in the associated sharing list. We estimate this by using the mean length of a sharing list, .
Note also that the various deletion operations (DC, DA, DR) comprise an initial \unhook" phase followed by either a separate creation (CR), addition (AD) or invalidation (IR) phase. After the unhook the cache line is essentially left in a temporary invalid (INV) state.
By substituting the class descriptions from Table 2 in place of the C i in Table 3 , a descriptive breakdown of each operation/state pair is produced. For example, the bus and network tra c involved by performing operation AD in state INV depends on the status and location of the new line which is to be read. There are four cases, each occurring with an associated probability. For example, if the line being read is on the same node and is valid with respect to Table 2 .
If, however, the new line's home node is elsewhere and if the home copy is invalid, due to a client write operation, then more work must be done. A short message is sent to the home node of the new line; when this is received the home node forwards a short request message to the current sharing list head which has an up-to-date copy of the line. This node claims both buses, fetches a copy of the line from its cache and returns it as a long message, targeted to the initiator of the AD operation. When this long message is received both buses at the initiator are claimed and the line is transferred to the second-level cache. This in turn restarts the processor. In this latter case two short messages and one long message are sent | their transmission does not a ect the bus queueing times but does add a delay to the overall read/write response time.
The (sum of the) coe cient(s) of C i for operation p in state s, together with the associated probability determine T p;s;i . If the coe cient is F and the associated probability is r then T p;s;i = rF. Similarly the coe cients of S and L (i.e. the number of short and long messages respectively) in Table 3 determine S p;s and L p;s respectively.
Modelling the Nodes
The node architecture is modelled as a queueing network with a server representing each bus. The bus delays depend on the type of transaction and so the transaction classes in Table 2 become service classes in an M/G/1 queueing model. Pointer traversal and pointer maintenance is handled by the network controller; this is pipelined and the associated delays are therefore assumed to be subsumed by the message transmission times.
We need to distinguish the class groupings in Table 2 and we write G i to denote the set of classes associated with Group i in the table. The set of transactions that require respectively the cache bus and memory bus are de ned to be:
The average arrival rate of transactions of class i is given by: The queueing network is complicated by the fact that internal requests from the processor and external requests from the network may require either one bus, or both buses, to complete a transaction. Group 3 transactions hold both buses at the same time, leading to a form of simultaneous resource possession in the queueing network and hence blocking-before-service. We develop an approximate solution to this problem by augmenting the service time at the memory bus with the waiting time at the cache bus, for transactions in group 3.
The n th moment of the service time at bus b 2 fc; mg is given by: The mean queueing time at the memory bus is complicated by the fact that bus transactions in class 3 require both buses to be held simultaneously. The M/G/1 model requires the second moment of service time and so we have to calculate the second moment of the waiting time at the cache bus. Since the service time at the cache bus is constant for the transaction class in question, the required second moment is given by the second moment of the queueing time at the cache bus, Q 2c . To simplify the notation that follows, the memory bus holding times listed for group 3 classes in Table 2 include their cache bus holding times as well; a sum of two constants. For consistency, the same applies to group 4. With the memory response time R known, the processor utilisation, or \system power" is then easily found:
Note, however, that in calculating the arrival rates for each bus transaction class we assumed the existence of by virtue of the factor = 0 in the de ning summation. We have thus introduced a new xed-point problem for determining and again appeal to an iterative solution method. Essentially, this is how we are solving a closed system in which no more than K nodes can be waiting for a memory access, the rest \thinking". We note in passing that some care has to be taken in updating the approximation to in order to ensure convergence. The execution time for the whole model is around 5 seconds using Mathematica 2.2 on a PowerMacintosh 7100/66 computer. This can be compared with the execution times for the simulator which, for realistic application benchmarks, can run into many hours, or even days. The bene ts of the analytical model in this respect are self-evident.
Comparative Analysis
In this section we undertake validation of the above analytical model against results obtained from an existing low-level event-driven simulator of the same system 9]. The simulator is capable of running real parallel programs, and collects statistics such as cache hit rates, bus utilisation, processor utilisation, etc. The simulator models the same target architecture, and has been calibrated identically, using the constants de ned in Table 2 , and the coherency protocol used adheres to the actions of Table 3 .
Validation obviously requires the analytical model and the simulator to use identical workloads. We have adopted two synthetic programs for this purpose: rst synth, a synthetic memory reference generator program is used, in which each processor repeatedly performs local operations for a random period, before making a memory reference to a location chosen randomly from a xed sized region according to a uniform probability distribution. Thus we take P loc = 1=K in the analytical model. The program is unrealistic in that no locality is modelled, but it does closely re ect the workload parameterisation of the analytical model. Moreover, the size of the program's memory is clearly the size of the given xed region. It is therefore reasonable to expect a close correlation between the two sets of results. The second program, synth-l, is a modi ed version of the rst, in which it is more likely that a processor references memory for which it is the home node. Speci cally we chose P loc = 3=k. In this way, we use P loc to re ect the locality which exists in real programs: programmers structure programs so that most references are to locally allocated data.
Methodology
The comparative analysis is done as follows: the simulator is rst used to execute a given benchmark program. This produces a set of workload parameters for the benchmark which are used to parameterise the analytical model. These consist of the mean think period 0 , the various hit rates for reads and writes, rh ; rm ; wh ; wm and the memory usage N, together with the machine con guration parameters K and n.
We rst compare the equilibrium line state probabilities predicted and then consider the key performance metrics, namely the bus and processor utilisations. From these mean bus queueing times and memory latency follow immediately: see the formulae at the end of Section 3.2. We remark that the execution time of the analytical model is of the order of ve seconds using Mathematica on a Macintosh Power PC. By way of contrast the simplest simulation runs here take of the order of three hours: this increases dramatically as the problem size is scaled.
Simulation Parameters
The simulation parameters for synth and synth-l are shown in Table 4 . For example, the rst benchmark run assumes a 4 node system with a total of 2048 lines of memory. The caches at each node are 256 lines in size, and therefore cache replacements will occur. Although the cache and memory sizes used here are small, their ratio largely determines the output of simulations. More realistic cache sizes require longer simulation times to ensure that results are not dominated by startup e ects. Table 6 : Miss rates and think times for synth-l
Simulated Miss Rates and Think Times
The miss rates and think times determined by the simulator characterise, albeit in a somewhat simplistic manner, the workload. These results, together with the simulation parameters (number of nodes, cache and memory size) form the parameters to the analytical model. The results are shown in Table 5 for synth and Table 6 for synth-l.
Equilibrium Cache Line State Probabilities
These are the results of the rst phase of the analytical model, which in turn form the input to the queueing model of the nodes (phase 2). It is vital, therefore, that close agreement is found between the corresponding results from the simulation and the analytical model. For each run of the simulator, the time spent in each cache line state is recorded, allowing state probabilities to be determined. Results are summarised in Tables 7 and 8 . Two lines for each run are shown: the upper line are simulation results, whereas the lower comes from the analytical model. Since the simulator is execution-driven, hence relating to speci c programs, con dence bands were not appropriate in its output analysis.
Note that the simulator does not di erentiate between certain pairs of states, and a single probability for such pairs Table 9 : Bus Utilisations for synth is produced. For example the states CHC and CHD are not explicitly maintained, and form part of CS. These states were introduced in the model in order to estimate P hv .
It can be seen that the results from the analytical model and the simulator show very strong agreement, especially in the case of uniform memory access. The largest discrepancies occur for rare line states which are less signi cant and for which the simulation-based probabilities are less reliable anyway. For non-uniform memory access, agreement is slightly less good when there are 4 nodes. Even here, the signi cant discrepancies are restricted to the invalid state (e.g. 3% vs. 5%) and the home-shared state (e.g. 6% vs. 11%) which occur with small probabilities.
These results indicate that the cache line state transitions de ned for the analytical model in terms of the workload parameters are accurate.
Bus Utilisation
A key performance metric for architectures of this type is the utilisation of the system buses. If utilisation is too high, requests will spend a relatively large period of time queueing for the buses, thus reducing overall system power. Alternatively, low utilisation implies that the cost of designing and implementing the decoupled bus arrangement may not be o set by an adequately large improvement in system performance. Results for the analytical model and simulator are shown in Table 9 . Again, simulation results are shown in the top line of each pair. 1. The constants de ned for the various operations in Table 2 agree with those of the simulator 2. The assumptions and simpli cations adopted in the model are valid.
Processor Utilisation
Our nal results concern processor utilisation, or system power. This is the proportion of the execution time of the program during which processors are active, and is the most important metric. The graph of Figure 4 shows the variation of processor utilisation as a function of the number of nodes, for the three memory sizes.
As expected, the analytical model's predictions are very close to those of the simulator. This is unsurprising in view of the good agreement already obtained for bus utilisations and line state probabilities, together with the fact that service time parameters and frequencies of transaction classes for each operation/line state pair are speci ed identically.
Processor utilisation decreases as the memory size is increased, since it becomes less likely that the caches will hold the required data. In our terminology, memory size is the size of a program rather than the number of lines provided in the architecture. Hence, increased memory size means a larger program. Utilisation also decreases as the number of processes increases, as expected, since the wider distribution of memory causes increased overheads. However, the decrease is quite slight and so elongation of memory latency (which follows directly from the de nition of processor utilisation) is not excessive. Hence we can be optimistic about the scalability prospects of the architecture and coherency protocol, at least up to 16 processing nodes. Table 11 : Comparison of equilibrium line state probabilities for MP3D quantity 3=K used to parameterise the benchmark. This removes the distortion in the memory reference characteristics which is caused by the simulator modelling an on-chip rst-level cache. This enables the request to be satis ed without stalling the processor. By using the same principle for synth an improved correlation between the two models is similarly observed.
A Real Benchmark
Finally we ran the simulator on the MP3D benchmark taken from the Stanford SPLASH suite 11]. Here we anticipated a number of problems. First and foremost, we had no way of estimating the \size" of the program. The simulator accumulates the total amount of memory used, but although this is the required value for the above synthetic programs, a real application runs in phases, each using its \working set" of memory. The required memory size N is this working set size in each phase. It will be much smaller than the total accumulated memory and vary from phase to phase. In fact, a separate instance of the model should be run for each of the phases that can be identi ed.
However, in the absence of further information, we decided to select a value for N that provided good agreement on the equilibrium probability that a cache line was invalid. The accuracy of the model could then be judged by the closeness of the agreement on the other performance measures, i.e. the remaining line state probabilities and bus and processor utilisations.
A reasonably good agreement was obtained for the equilibrium line state probabilities, as shown in Table 11 . However, the bus utilisations were over-estimated by the order of 40% in each case. This is almost certainly due to the fact that the MP3D benchmark spends a large proportion of its execution time idle due to barrier synchronisations. A more sophisticated workload model would be required to capture this behaviour, and hence limit the discrepancy between the models. This and other ideas for future work are discussed in Section 5.1.
Summary and Conclusions
We have developed a novel analytical model for a distributed cache coherency protocol running on a realistic shared memory multiprocessor. We have also evaluated the accuracy of the model against an execution-driven simulation. The target architecture is quite sophisticated, for example having decoupled buses to reduce contention, and an objective of the coherency protocol is to reduce load on the cache bus. An immediate goal of our research is to measure, and account for, any major inaccuracies in the analytical model under various real workloads, particularly with reference to the model of the coherency protocol and internal bus contention.
Our experiments with a synthetic reference generator program have shown a remarkably close match between the mathematical model and simulation in almost every case. Predictions for processor utilisation and state transition rates correspond closely to those found by a detailed simulation. Although the random reference generator models closely the assumptions made in the analytical model, the comparison is still important since a number of approximations are used in the analysis. These include the assumptions that line state transitions de ne a Markov process, implying that state holding times are negative exponential random variables, and that arrival processes are Poisson as well as the approximate model of simultaneous resource possession. The validation has demonstrated that the analytical model is robust in terms of accuracy.
Both internal and external communication tra c is represented in considerable detail in the analytical model and this has proven to be one of the major successes. The protocol description, which is essentially an abstraction of the simulator code itself, is broken down into a large table comprising counts of the number of instances of various bus transaction classes, together with the number, and type, of messages which need to be sent, for each type of coherency operation. Because of the close match between the line state probabilities from both models, the number of such transactions and messages actually observed in the simulation are predicted accurately by the model. Some of the discrepancies are exaggerated by the way the simulator handles network tra c. For example, coherency operations which have to queue at the controller in the simulation are assumed to be pipelined in the analytical model, the delay being absorbed into the message transmission time. The mean queue length here is small, however, and this limits the e ect. We predict that the di erences will become greater in programs which exhibit high degrees of read contention since this will inherently increase the mean queue length at the network bu er. An extension to the queueing model may be required here, although it could be argued that from the architecture point of view the simulator's treatment of this tra c is far from optimal.
Future Work
The obvious next step is to complete the validation for this model by running it against more realistic parallel programs. The simulator is capable of running programs from the Stanford SPLASH suite 11] and a validation based on selected programs from the suite is is currently in progress. Author's note: space permitting, these results could be presented in the nal version of this paper.
The analytical model described is deliberately simplistic in nature. The workload model assumes uniform usage of the cache lines, even though for some applications cache usage is heavily \regionalised". Numerous extensions to the workload model are possible in this respect; a regionalised cache model was studied in (reference omitted), for example, and the same could be done here. Also a more sophisticated model of the memory addressing pattern will almost certainly be required to model some of the more esoteric benchmark programs. The model does currently allow locality of reference to be captured in a single probability. It would also be interesting to examine the e ect on performance of a signi cant proportion of read-only data. This data would have read probability of unity and its own hit/miss ratio. As a result, sharing lists would be much longer since they would only reduce on a displacement.
The queueing model of the node and the representation of the network tra c are, however, very robust. What has proven particularly tedious, however, is the production of the protocol speci cation table and bus transaction class descriptions. A protocol speci cation and modelling tool which could automatically produce this information, and indeed even the coherency protocol simulation itself, would be of considerable bene t. It is conceivable that formal correctness proofs of a new protocol may also be feasible using such a tool as the starting point.
Finally, the modelling technique proposed here must be demonstrably useful in other protocols. Of particular interest to the authors are various schemes for reducing contention at the cache controller. Predictive performance models of new protocol proposals such as these promise to yield considerable savings in simulation development and execution time during the early stages of a new design.
