Analytic model of a cache-only memory architecture by López Barrio, Carlos Alberto & Hermenegildo, Manuel V.
Analytic Model of a 
Cache Only Memory Architecture* 
Carlos Carreras1 , Carlos A. López1 and Manuel Hermenegildo2 
Departamento de Ingeniería Electrónica, Universidad Politécnica de Madrid, 
Ciudad Universitaria s/n, 28040 Madrid, Spain 
Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, 
28660 Boadilla del Monte, Madrid, Spain 
Abs t r ac t . An approximate analytic model of a shared memory multi-
processor with a Cache Only Memory Architecture (COMA), the bus-
based Data Difussion Machine (DDM), is presented and validated. It 
describes the timing and interference in the system as a function of the 
hardware, the protocols, the topology and the workload. Model results 
have been compared to results from an independent simulator. The com-
parison shows good model accuracy specially for non-saturated systems, 
where the errors in response times and device utilizations are indepen-
dent of the number of processors and remain below 10% in 90% of the 
simulations. Therefore, the model can be used as an average performance 
prediction tool that avoids expensive simulations in the design of systems 
with many processors. 
1 Introduction 
Performance prediction is a key issue in the first stages of the design of a shared 
memory multiprocessor. Approximate analytic models have been propossed to 
obtain average performance estimates fast [9, 14, 12]. They allow exploring the 
parameter design space for different workloads. Analytic models provide guid-
ance to adjust the system parameters so that detailed simulations are only re-
quired in a verification phase. The analytic model becomes a compromise be-
tween a detailed description that captures all the relevant features of the system 
and the simplifying approximations that allow fast computation of performance 
estimates. This paper presents an analytic model of a specific shared memory 
multiprocessor, the bus-based Data Difussion Machine (DDM). 
The DDM [13, 4, 5] belongs to a new class of architectures called Cache Only 
Memory Architectures (COMA). These are distributed shared memory systems 
with directory-based coherence where memory modules behave like large second 
level caches. COMA constitute a new approach to design large-scale multipro-
cessors. The bus-based DDM is a COMA with a coherence protocol supported 
by a scalable hierarchy of directories and buses. Other examples of COMA are 
the link-based DDM from Bristol University [10] and the comercial ring-based 
KSR1 from Kendall Square Research [1]. 
The analytic model developed here is based on similar principies to those 
found in [9]. The response t ime of a closed system, the DDM, is obtained from 
the response times of the open models of its subsystems: the processors, the 
memories and the interconnection network. At the subsystem level, probabilis-
tic models are used. Contention is modeled using simplified queueing systems 
previously validated. The error introduced by these approximations is reduced 
with parameter adjustment from system simulations. 
The model equations are solved through iteration. Convergency occurs quick-
ly, usually after a few iterations. The solution corresponds to the steady state 
situation where response times and execution rates are balanced. The outputs of 
the model describe in detail the overall system's performance as well as internal 
t rame rates and delays. 
Validation is performed against simulation results available from an indepen-
dent simulator developed at the Swedish Insti tute of Computer Science (SICS). 
These results correspond to real applications from the SPLASH suite of bench-
marks [11] executed over different topologies. The comparison shows that the 
error introduced by the approximate model is acceptable specially if the system 
is not saturated, which is the desirable condition. In this case, errors are inside 
the expected bounds established in [7] (30% for response times and 10% for 
utilizations) in more than 90% of the simulations. 
The model is extended with approximate scalable models of the main appli-
cation-dependent parameters as a function of the topology and the number of 
processors. Uniform memory references are considered. Even though this is the 
worst case for COMA, modeled parameters show qualitative changes similar to 
those observed in parameters obtained from simulations. 
The paper is structured as follows. The next section introduces the main 
features of the DDM. Section 3 presents the modeling methodology. Section 4 
develops the processor, memory and network models. The algorithm to solve the 
model equations is presented in Sect. 5. Section 6 describes the simulation envi-
ronment and the benchmarks used to validate the model. In Sect. 7, the model 
is validated against existing simulator results. In Sect. 8, approximate models of 
the application-dependent parameters are derived and evaluated. Finally, con-
clusions are summarized in Sect. 9. 
2 The DDM Multiprocessor 
The DDM has been introduced as a COMA. The main contribution of COMA 
is the definition of protocols that allow data to replicate and migrate between 
memory modules dynamically at run time, as oppossed to systems where data 
is statically distributed between the memories. Memories behave like large sec-
ond level caches, changing their contents according to the needs of their local 
processors. This results in overall t rame reduction if the allocation of tasks to 
processors takes advantage from the locality of memory references. Therefore, 
COMA provide a new strategy to design large-scale multiprocessors. The cost 
of the COMA strategy is the search procedure required to lócate data in the 
system. A hierarchical COMA performs a hierarchical search, which is fast for 
requests to neighbour nodes in the hierarchy, but penalizes requests to distant 
nodes with longer search times. The bus-based DDM is a hierarchical COMA 
implemented through a hierarchy of buses and directories. 
Fig. 1. A basic bus-based DDM (P: Processor/Cache, M: Attraction Memory, D: Di-
rectory) 
A block diagram of a basic bus-based DDM is represented in Fig. 1. It has 
three main subsystems: the processor nodes which include a first level of caches, 
the at traction memories which are memories that support the DDM protocol, 
and the hierarchy of buses and directories. Directories don't store data. They 
only store state information used to lócate data in the memories below them. 
Attraction memories store data and state information associated to them. In 
this situation, read or write requests that can't be serviced by the local at trac-
tion memory are forwarded up the hierarchy. A request reaching a DDM bus 
is processed in two pipelined phases: bus transaction and state lookup in the 
devices connected to the bus. The state lookup allows locating the requested 
da tum. A transaction goes up the hierarchy until the search is successful, and 
then goes down to the at traction memories below where data is actually stored. 
The DDM protocol allows that the reply to a search follows the path back to 
the requesting memory module, while keeping consistency between the contents 
of memories and directories. It is a write-invalídate protocol. 
A prototype of a clustered bus-based DDM is near completion at the Swedish 
Insti tute of Computer Science (SICS). In a clustered DDM, memories provide 
local service to a cluster of processors through a fast bus. A local coherence pro-
tocol is required for local consistency in the cluster. The implementation details 
in the prototype are based on the M-bus and the 88000 series from Motorola. In 
this case, the local protocol is of type write-once. It has been adapted to work 
together with the DDM protocol so that all data in caches and memories are 
kept consistent. 
3 Modeling Methodology 
The analytic model of the DDM is obtained from the open models of its sub-
systems. This approximation allows modulari ty in the model and has been used 
to evalúate different types of processor and memory nodes. However, only the 
types used in the SICS prototype, from which simulator results are available, 
are presented here. 
Three types of subsystems are considered in the model: the processor nodes, 
the memory nodes and the network of directories. A memory node model sup-
ports local and remote operation. It is assumed that all subsystems of the same 
type have the same probabilistic behaviour. Each subsystem model expresses its 
output transaction rates and response times as a function of its input transac-
tion rates and internal parameters and queues. The average input rate to the 
system is Favg instructions per processor cycle and its average response t ime 
is Tavg processor cycles per instruction. All subsystem input rates are obtained 
from Favg, while Tavg depends on all subsystem response times and queue wait-
ing times. Queues are solved locally for each subsystem using simple queueing 
models. Therefore, this approach results in three sets of equations: transaction 
rates, queue waiting times and response times. The steady-state condition says 
that Favg = 1/Tavg, so these sets are related in a cyclic fashion and the general 
equation of the model can be expressed as Tavg = 3r{G[H(TaVg)]}-
Subsystem models have been developed under a common methodology which 
can be summarized in the following steps: 
1. Classify the input transactions according to the requested operations and 
obtain their rates. 
2. Define the internal parameters describing hardware and application vari-
ables. 
3. Define all possible internal operations in the subsystem according to difier-
ences in operation response times or generated output transactions. 
4. Obtain the probability of each operation. 
5. Determine the transaction rates and utilization times involved in internal 
queues and obtain the expressions of the queue waiting times. 
6. Obtain the subsystem's average response times and the output transaction 
rates per class. 
Input transaction class rates are obtained from the output rates of other con-
nected subsystems (Step 1). Favg is an input class to the processor nodes, being 
TaVg their average response t ime. The expressions of operation response times 
or generated output transactions (Step 3) are determined from the input rates, 
the internal parameters, the subsystem's architecture, the supported protocols 
and other subsystems' response times. The probability of each operation (Step 
4) is obtained assuming that the internal parameters are independent. In gen-
eral, rates and times in the queues (Step 5) are not equal to operation rates and 
times and must be determined. The queues are solved using simplified queueing 
models previously validated against results from a queue simulator. The sub-
system's average response times (Step 6) are computed as weighted averages of 
the internal operation response times involved. Finally, the subsystem's output 
class transaction rates are obtained as weighted additions of the transactions 
generated by internal operations. In both cases, the weights are obtained from 
the probabilities of the operations. 
4 Subsystem Models 
4.1 P r o c e s s o r N o d e M o d e l 
The model of a processor node represents a single-threaded processor with sep-
árate instmction and data caches. The processor remains idle on cache misses. 
The data cache supports a write-back local coherence protocol. Contention in 
local niemory accesses is included in the niemory model so the processor node 
model has no internal queues. 
Input transaction classes are Favg from the processor's own thread, and data 
transactions, write-acknowledgements, mvalidation requests and update requests 
from the local niemory. The internal parameters include the cache line size, the 
probability that an instmction is a niemory reference (pmem), the probability 
that a niemory reference is a write (pwrite), the instmction and data cache miss 
ratios (m¡,mu), the probability that a data cache miss requires cache replace-
ment, and the minimum average response t ime for the application obtained con-
sidering no cache misses. Operations are defined according to the previous input 
classes, probabilities and miss ratios. Output transaction classes are instmction 
reads, data reads without replacement, data reads with repl., data writes without 
repl., data writes with repl., and local memory updates. The resulting model ex-
presses the output transaction rates in terms of Favg and provides an equation of 
TaVg in terms of the memory response times for the different output transaction 
classes. 
4.2 Clus ter M e m o r y N o d e M o d e l 
This model includes an attraction memory and the bus that connects it to the 
processors of the cluster. The attraction memory is assumed to be large enough 
to satisfy all private requests from local processors, so that only shared data 
travel to and from the remote memories through the network of buses. It is 
also assumed that remote memory references are uniformly distributed among 
the at traction memories. The cluster memory node is modeled according to the 
SICS implementation around the M-bus. A block diagram is shown in Fig. 2. 
Accesses to the data memory are part of the read and write M-bus transactions, 
while the state memory sits in the M-bus like another requesting device while 
interfacing to the network of directories. The mterrupt-retry mechanism of the 
write-once M-bus protocol allows the devices in the bus to interrupt the current 
transaction which is retried later. This is used by the state memory to maintain 
remote consisteney and by the local caches for internal cluster cohereney. 
Input transaction classes from below are the processors' output classes. In-
put classes from the network are read requests, invalídate requests and replace 
requests from remote memories, and replies to previous read and invalídate re-
quests from the at traction memory itself (data transactions from remote memo-
ries and invalídate-acknowledgements from upper level directories). Replacement 
transactions relocate a replaced memory line in a different at traction memory. 
DIRECTORY NETWORK NODE 
STATE 
MEMORY 
CLUSTER MEMORY NODE 
" I 
I 
DATA 
MEMORY 
M-bus 
T 
PROCESSOR 
NODE 
PROCESSOR 
NODE (...) 
PROCESSOR 
NODE 
Fig. 2. Organization of a cluster memory node 
Internal parameters include the number of processors per cluster (No), the mem-
ory line size, independent memory miss ratios for reads and writes (mr¡o, rne¡o), 
the probability that a memory miss requires replacement in the memory, the 
probabilities that local or remote transactions in the M-bus cause a memory up-
date, the probability that a request requires arbitration to access the M-bus, and 
detailed hardware parameters that specify the internal t iming of M-bus accesses. 
Again, operations are defined according to these probabilities and miss ratios. 
Output transaction classes to the processors have been listed in the previous 
section, while output classes to the network are read requests, invalídate requests 
and replacement transactions, and replies to remote read requests previously 
received (data transactions to remote memories). 
While the dual-port state memory can be accessed without interference from 
above and below, there is contention to access the M-bus. In this situation, 
the M-bus is modeled as a queueing system with múltiple classes of independent 
Poisson arrivals and deterministic service times. The round-robm M-bus schedul-
ing policy selects a new device when the currently serviced device has no more 
pending requests. Arbitration only occurs when a new device is selected and the 
arbitration t ime is assigned to its first pending request. This policy has no exact 
queueing model so approximate models have been used. In particular, a model 
that approximates such policy to an ordered sequence of (No + 1) fixed priority 
schemes, one per device in the M-bus, has been developed. This model, called 
RR here, expresses the average waiting t ime in a M-bus queue, Wq, as 
N0 + l . JVo + 1 
w
* = E (y } £ {Ef)Wkm (1) 
k = l m = \ ^ 
where X¡¡ and A are the total arrival rates to queue k and to the M-bus respec-
tively, pm and p are the M-bus utilization by requests from queue m and the 
total M-bus utilization, and Wkm is the waiting t ime in queue k when queue m 
is serviced, that is, when queue m has máximum priority. Assuming that queue 
(m — 1) follows queue m and queue (No + 1) follows queue 1 in the ordering of 
queues, the valué of Wkm is obtained from the adapted fixed priority equations 
Wo 
Wkm = -. ^m " ^m 7 (í < k < TU < N0 + í) 
í 1 - l^i = k PiK1 ~ l^i = k + l Pi) 
Wkm = — - ^ ^ -—k (1 < m < k < No + 112) 
where Wo is the mean residual life of a service t ime in the M-bus, which can 
be computed from arrival rates and service times [6]. Besides the RR model, an 
approximate model based on FCFS scheduling has also been used. Simulations 
for different classes, queues and M-bus utilizations show that the RR and FCFS 
results are always inside the 99% confidence interval around the mean simulated 
valúes, tha t is, the interval containing 99% of simulated waiting times. 
4.3 D i r e c t o r y N e t w o r k M o d e l 
This model is based on the directory node which includes a directory that stores 
states and a DDM bus. The internal structure of a directory node is similar to 
that in Fig. 2 but without a data memory. The level-1 directories interface with 
the memory nodes. Otherwise, directories connect to other directories above 
and below as in Fig. 1. Directory accesses and arbitration are pipelined with the 
transactions in the DDM bus. 
Input and output classes to and from the network were described in the 
previous section. Internal parameters include the number of levéis in the hier-
archy (L), the branching factor at each level (Ni), the DDM bus cycle t ime at 
each level, the number of additional bus cycles required by transactions that 
carry data, independent directory miss ratios for read and invalídate requests 
(mr¡i, me¡i), and the average number of memories with a copy of a da tum at the 
t ime the da tum is written (K). Wi th respect to replace requests from a memory, 
it is assumed that they are always serviced by another memory connected to the 
same level-1 DDM bus, so they are not transferred up the network. It is con-
sidered that the branching factor Ni is constant at each level. Therefore, only 
balanced topologies are modeled. Operations in the network are defined from 
the input classes, the number of levéis and the read and invalídate miss ratios 
at each level. 
Again, interference in each bus is determined considering Poisson arrivals 
and deterministic service times. The round-robín DDM bus scheduling policy 
selects a new device after each transaction, so it is different from the M-bus 
policy. However, simulation results show that the RR and FCFS approximations 
can also be used since they are always inside the 99% confidence interval. The 
t rame in each queue is computed from the protocol definition, the directory miss 
ratios, the branching factors, and, in the case of children invalídate transactions 
descendant from an invalídate request going up the hierarchy, from the parameter 
the average number of copies between writes, K. 
5 The Algorithm 
The algorithm to solve the model for steady-state conditions assumes that an 
interval [a,b] is known to contain the model 's solution for Tavg. In this case, 
iterative binary división of the interval is a fast method to find such solution 
if convergence conditions are met. Binary división implies introducing the mid-
point valué of the interval, x, in the model equations. If F {Q\H{x)\} > x the 
subinterval [a, x] includes the solution and is selected for the next iteration. Oth-
erwise, the subinterval [x,b] is used. The iterative process is repeated until the 
interval is small enough to assure that the error is less than e. The algorithm 
is fast since only a few tens of iterations are required to get an approximate 
solution with e = 0.01 processor eyeles. The DDM model meets the convergeney 
conditions [9] since increasing response times cause decreasing execution rates, 
increasing transaction rates cause increasing queue waiting times and T, Q and 
Tí are positive continuous functions, at least in a domain around the solution. 
Computing ^F{Q\H{x)Y\ in the iterative process is easy as long as the three 
sets of model equations maintain an ordered dependeney. In the DDM, this 
ordering is broken by the interrupt-retry mechanism of the M-bus protocol: re-
quests in the M-bus which are forwarded to remote memories are repeatedly 
issued and interrupted in the M-bus until the reply arrives from the network. 
Therefore, there is an internal loop of dependencies between transaction rates in 
the M-bus and network response times. This new internal eyelie set of equations 
is solved again through iteration. Therefore, the actual algorithm for the DDM 
model consists of two nested iteration loops. 
Convergeney requires that the valué a in [a,b] is a feasible valué of Tavg. 
Otherwise, bus utilizations may go above 1 and the method does not converge. 
In this situation, a is determined as the minimum Tavg (máximum Favg) tha t 
keeps all bus utilizations below 1 in the model, and b is the (máximum) Tavg 
derived from a, b = F{G\H{a)\}. The valué of a is obtained again by binary 
división of any interval [Ti, T'j\ where T\ forces at least one bus utilization to be 
greater than 1 and T'¿ causes all bus utilizations to be less than 1. 
6 Simulation Environment and Benchmarks 
The results from a simulator of the DDM developed at SICS [3] have been used 
to verify the analytic model. It is an execution-driven simulator [8] that provides 
system and device performance statistics as well as counts of internal events. 
Systems with one or two DDM levéis and clusters with up to four processors 
running at 20 MHz have been simulated for each benchmark. Instruction and 
data caches of 16 Kbytes and cluster memories of 32 Mbytes are used. The 
memory line size is set equal to the cache line size which is four words. The DDM 
bus cycle times are set to four processor cycles in all levéis of the hierarchy, and 
all transactions in DDM buses are processed in one bus cycle. The simulator also 
includes secondary memory from where the benchmarks are initially loaded. It 
is accessed through a M-bus like an attraction memory but from the upper level 
DDM bus controller (root). Accesses to this unit mostly occur during s tar tup 
with rates that are negligible when to the whole simulation t ime is considered. 
Three benchmarks from the SPLASH suite representing real scientific parallel 
applications, WATER, MP3D and CHOLESKY, have been simulated. They are 
described in detail in [11]. Besides, simulation results for a matr ix multiplication 
program, MATRIX, are also available. WATER computes forces and potentials 
in a dynamic system of liquid water molecules. Input systems with 192 and 384 
molecules and working sets of 320 and 640 Kbytes have been used. MP3D sim-
ulates very low density fluid flow around an object using a discrete statistical 
model of the médium. Most available results correspond to a 14 x 24 x 7 cell 
space with 75,000 molecules for a working space of 4 Mbytes. Other simulations 
consider 40,000 and 80,000 molecules. CHOLESKY is a sparse matr ix factor-
ization parallel program. Results for matrices bcsttklJt and bcsttkl5 (see [11]), 
which occupy 420 and 800 Kbytes unfactored and 1.4 and 7.7 Mbytes factored 
respectively, are available. In the following section, they are treated like individ-
ual benchmarks named CHOLESKY14 and CHOLESKY15. Finally, MATRIX is 
a program that multiplies two matrices using a blocking algorithm. The matr ix 
size is set to (500x500) elements for a working space of 3 Mbytes. 
7 Model Validation 
Parameters of the analytic model which depend on the hardware or the topology 
are obtained from the simulator inputs. The resulting latencies in cycles for basic 
memory read and write operations (local and remote without contention and 
arbitration) are summarized in Table 1. 
Application-dependent parameters of the model are extracted from the in-
ternal counts of the simulator. Then, the performance results from the model 
and the simulator are compared. Simulated performance is available in terms of 
average instruction execution t ime per processor and bus utilizations. The iter-
ative nature of the algorithm, which departs from an arbitrary interval, assures 
that the model results are actually obtained from the model and not directly 
from the input parameters. 
Table 1. Latencies for memory references 
Reference 
(cache miss) 
read 
write 
Local 
(M-bus) 
7 
10 
Remote 
(level 1) 
38 
32 
Remote 
(level i) 
(38 + 16¿) 
(32 + 8¿) 
Analytic model and simulator have been developed independently following 
different approaches. In fact, a few input parameters of the model must be ob-
tained through approximate computations. Besides, a secondary memory module 
had to be added to the model. However, results are quite acceptable. Table 2 
contains, for each application and configuration for which simulator results are 
available, the errors obtained from the model. In some cases, different application 
input parameters have been used to simúlate the same system. 
The notation for each configuration has the form [No x Ni x N2 x • • •], where 
No is the number of processors per cluster and Ni is the branching factor at level 
i in the hierarchy. The minimum system with only one processor is [ l x l ] . Tavg 
describes the system's response time, Ucluster represents the M-bus utilization, 
and Ui,us¡i and Ui,Us,2 refer to the utilizations of level-1 and level-2 DDM buses. 
Errors are differences between simulated and modeled valúes expressed as per-
centages of the simulated valúes. The numbers in parenthesis relate to absolute 
bus utilizations below 0.01. They are not considered in the comparison since they 
represent negligible absolute errors. This is also the case for bus utilizations in 
the secondary memory module, which are not included in the table. 
If total numbers from Table 2 are analyzed, the errors in system response 
times (Tavg) for most configurations (97%) are well within the 30% margin de-
fined in [7] as acceptable. Errors in device utilizations are acceptable if they 
are within a 10% margin around the simulated valúes. This is verified by most 
Uc ¡uster and Ubus, 2 results (89% and 93% respectively), but only by some Ui,Us,i 
valúes (35%). The low percentage obtained for Ui,Us,i is due to the saturation of 
the M-bus (Uduster > 0.8) for MP3D and CHOLESKY15. Small errors in traffic 
rates in the M-bus cause significant errors in M-bus waiting times and, there-
fore, in traffic rates reaching the level-1 DDM bus. Once it is concluded that 
the model degrades in saturation, it can be observed that errors in Ui,Us,i for 
the non-saturating applications are below or around 10% in most cases (93%). 
Finally, it can be observed that errors don't show any clear dependency on the 
number of processors or levéis in the system. 
8 Application-dependent Parameters 
So far, the application-dependent parameters of the model have been obtained 
from simulation. In order to model large systems, models of these parameters as 
a function of the topology are required to scale the valúes obtained from small 
system simulations. 
T a b l e 2 . Difference between analytic and simulated valúes 
Applic. 
WATER 
WATER 
WATER 
WATER 
WATER 
WATER 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
MATRIX 
MATRIX 
MATRIX 
MATRIX 
Config. 
[ i x l ] 
[4x1] 
[4 x 8] 
[ 4 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[ i x l ] 
[4x1] 
[4 x 2] 
[4 x 4] 
[ 4 x 4 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[ i x l ] 
[2 x 4] 
[2 x 8] 
[2 x 8] 
[1 x 32] 
[1 x 32] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[ i x l ] 
[4x1] 
[4 x 4] 
[4 x 8] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ i x l ] 
[4x1] 
[4 x 4] 
[4 x 8] 
(%) Tavg 
-0.23 
3.26 
2.05 
-0.32 
2.76 
4.55 
-0.02 
5.23 
21.89 
23.17 
27.75 
1.81 
2.87 
9.26 
4.11 
0.44 
6.02 
5.39 
2.70 
-1.24 
2.11 
-1.88 
-1.01 
3.40 
0.05 
19.36 
33.57 
23.31 
-0.94 
8.67 
9.40 
-0.21 
4.13 
2.92 
1.51 
(%) Uduster 
0.92 
0.28 
3.13 
2.06 
0.18 
0.27 
-0.66 
-14.20 
-5.23 
-1.15 
6.64 
2.83 
1.85 
-3.40 
-7.17 
-0.82 
13.33 
9.99 
8.44 
-7.06 
-5.38 
4.74 
7.47 
2.07 
-0.85 
-11.20 
-1.73 
3.24 
0.07 
7.20 
5.24 
-10.40 
-2.58 
-4.65 
-2.53 
(%) ubus¡1 
(23.44) 
(25.53) 
-1.37 
-8.37 
-10.36 
-11.63 
(0.00) 
(-25.00) 
-18.94 
-20.01 
-29.72 
-10.08 
-10.69 
-15.45 
-17.38 
(40.48) 
-5.39 
-4.66 
-4.34 
-2.47 
-2.60 
-8.98 
-8.60 
-13.69 
(20.00) 
(23.50) 
-25.13 
-18.85 
-24.84 
-16.39 
-17.41 
(-100.00) 
(12.50) 
-2.64 
-1.90 
(%) Uiu.,2 
5.39 
0.70 
-2.03 
-22.03 
-1.83 
-3.24 
-8.02 
-9.82 
5.64 
2.26 
0.79 
-7.08 
-6.50 
-4.84 
The three main application parameters that directly depend on the topology 
are the memory miss ratios ( m r o , rne¡o), the directory miss ratios ( m r ¿ , m e ¿ ) , 
and the average number of copies of a shared da tum between invalidates (A'). 
Their models are based on parameters which are assumed to remain constant for 
all topologies, including two new parameters: the probability that a request to 
the local memory references a shared da tum {pshared) and the probability that 
a write request to the local memory references a shared da tum {pwrite,shared)-
The model of K is developed first. Star tup effects are not considered and 
read-only shared data is assumed to be loaded in the local memories. In this 
situation, when a shared da tum is written, (Á' — 1) copies in memories are 
invalidated and one is updated. Therefore, (Á' — 1) can be expressed as the rate 
ratio between read requests and invalídate requests issued to the network. If 
rrir,shared and meshared are the local memory read and write miss ratios defined 
when only shared data references are considered: 
Pmem^^D fPshared^^r, shared / 0 \ 
A = 1 H (ó) 
PmemPwrite^^DPwrite,shared^^e, shared 
It is known that at least (Á' — 1) processors read the shared da tum between 
writes and, if N is the total number of cluster memories, at least [No{N — 
K)] processors don't read it. This proportion can be extended to approximate 
the behaviour of the remaining processors for a total of P processors. In this 
situation, mrshared can be approximated to the probability that a read request 
reaching the local memory is the first one in the cluster requesting the shared 
da tum since the last t ime it was written, multiplied by the probability that 
this cluster did not issue the last write. With respect to Tfle shared? i t C a n b e 
approximated to the ratio between the number of processors outside one cluster 
and the total number of processors in the system except for the one requesting 
the write: 
_ N0{N - K) + {K - 1) f _ l_ 
77V, shared —
 D 1 AT 
_P-N0 m 
^¿e , shared — T-, ^ \^) 
Substituing (4) in (3) and solving for K: 
1+A(P-1) 
K 
1 + A{N0 - 1) 
Á _ {mi + PmemmD)pshared (^PJ^-) , 
( P— /V ^ p_ l 
The models of memory miss ratios depend on K through mr,shared a n d 
^"w rite,shared-
^ r , 0 — Pshared'^r, shared 
^ e , 0 — Pwrite, shared^^e, shared \v) 
Directory miss ratios are modeled using a combinatorial approximation based 
on the uniform distribution of memory references. It is considered that the av-
erage number of copies available for a read request reaching the network is K/2, 
while the average number of copies to erase by an invalidate request is (K — 1). 
Considering a generic directory at level i, mr¡i expresses the probability that 
none of the K/2 copies are in the # ¿ # ¿ _ i • • • Ni cluster memories below it. Sim-
ilarly, m e ¿ is the probability that not all (K — 1) copies are in the memories 
below: 
N-NN-i---Ni 
int(K/2) 
N - 1 
int{K/2)t 
NN_i---Ni-
int(K - 1) 
N - 1 
int(K - 1) 
(7) 
Since K is not an integer, the previous combinatorial expressions approximate 
K/2 and (A' — 1) to their closest integer valúes. It should be mentioned that this 
can lead to non-convergency situations when solving the DDM model if the error 
bounds that stop the iterative process are very tight. 
It is interesting to compare the results of these models with the valúes ob-
tained from the simulator, even though the simulations don't meet the model 
assumptions. The results are in Table 3, where superscript (s) identifies simu-
lated valúes and superscript (a) refers to results from the analytic models. They 
show that , in most cases, the analytic valúes for a given benchmark describe the 
simulated behaviour, at least qualitatively. The valúes of m r ¿ and m e ¿ are not 
as significant since the models are intended for larger configurations. 
9 Conclusions 
Cache Only Memory Architectures have appeared as a new alternative to build 
large-scale shared memory multiprocessors. They have a memory coherence pro-
tocol that allows migration and replication of data. In this paper, an approximate 
analytic model of a COMA, the bus-based Data Difussion Machine, has been 
presented. The DDM protocol is supported by a hierarchical network of direc-
tories. The purpose of this model is to predict the average behaviour of systems 
with many processors. 
The DDM model is based on the open models of its subsystems. A method-
ology to develop the subsystem models has been presented and applied to obtain 
the processor, memory and directory network models. Computat ion t ime to solve 
the complete DDM model is low (less than 1 second in all modeled configura-
tions). The model results have been compared to results from an independent 
simulator for some real benchmarks from the SPLASH suite. The comparison 
Table 3. Comparison between analytic and simulated parameters 
Applic. 
W A T E R 
W A T E R 
W A T E R 
W A T E R 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
MP3D 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-14 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
CHOL-15 
MATRIX 
MATRIX 
Config. 
[4 x 8] 
[ 4 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[4 x 2] 
[4 x 4] 
[ 4 x 4 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[2 x 4] 
[2 x 8] 
[2 x 8] 
[1 x 32] 
[1 x 32] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 1 x 8 x 2 ] 
[4 x 4] 
[4 x 8] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[ 2 x 8 x 2 ] 
[4 x 4] 
[4 x 8] 
K(a) 
2.17 
2.49 
2.36 
2.30 
1.43 
1.80 
2.07 
2.24 
2.25 
2.26 
2.24 
2.00 
2.17 
2.23 
2.42 
2.42 
2.31 
2.31 
2.35 
1.85 
2.10 
2.91 
2.28 
2.28 
2.85 
2.94 
K(s) 
2.13 
2.22 
2.19 
2.14 
2.01 
2.00 
2.04 
2.05 
2.06 
2.06 
2.03 
2.04 
2.07 
1.93 
1.91 
2.02 
1.94 
2.05 
1.89 
2.05 
2.07 
2.24 
2.03 
2.05 
2.00 
2.04 
(a) 
m
r,ó 
0.25 
0.30 
0.32 
0.34 
0.07 
0.18 
0.27 
0.34 
0.34 
0.34 
0.35 
0.28 
0.42 
0.42 
0.56 
0.56 
0.50 
0.50 
0.53 
0.09 
0.13 
0.16 
0.17 
0.17 
0.01 
0.02 
"
lr,0 
0.27 
0.31 
0.35 
0.43 
0.16 
0.24 
0.28 
0.31 
0.32 
0.32 
0.31 
0.31 
0.44 
0.47 
0.56 
0.55 
0.53 
0.52 
0.51 
0.10 
0.12 
0.13 
0.14 
0.12 
0.01 
0.03 
(a) 
0.72 
0.76 
0.77 
0.80 
0.34 
0.48 
0.54 
0.58 
0.58 
0.58 
0.60 
0.77 
0.84 
0.84 
0.90 
0.90 
0.87 
0.87 
0.90 
0.72 
0.81 
0.87 
0.87 
0.87 
0.32 
0.36 
"
le,0 
0.69 
0.73 
0.79 
0.95 
0.35 
0.52 
0.58 
0.62 
0.61 
0.63 
0.63 
0.81 
0.93 
0.99 
0.99 
0.99 
0.99 
0.99 
0.99 
0.65 
0.79 
0.99 
0.85 
0.78 
0.12 
0.54 
(a) 
m
r,í 
0.53 
0.53 
0.53 
0.57 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
<} 
0.36 
0.35 
0.34 
0.55 
0.50 
0.48 
0.49 
0.51 
0.56 
0.48 
0.56 
0.47 
0.52 
0.45 
(a) 
me,l 
0.53 
0.53 
0.53 
0.57 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
0.53 
<l 
0.49 
0.46 
0.43 
0.58 
0.54 
0.54 
0.53 
0.54 
0.62 
0.53 
0.60 
0.49 
0.53 
0.46 
shows that the model is specially accurate if the systems are not saturated. In 
this case, the error in response times and device utilizations is below 10% in 
around 90% of the simulated systems. 
The application-dependent parameters of the model must be described in 
terms of the system's topology if machines with many processors are to be eval-
uated. At this respect, approximate models of the parameter K (copies of a 
shared da tum between writes) and the read and write miss ratios of memo-
ries and directories have been presented. They describe the same qualitative 
behaviour of the parameters observed in the simulations. 
The model of the DDM have been used to evalúate the behaviour of COMA 
in [2]. Future plans include the adaptat ion of the application-dependent models 
to describe specific real applications. In particular, locality in the distribution of 
memory references is the next aspect to be included since it is directly related 
to the possible advantages of COMA. 
References 
1. H. Burkhard t , S. Frank, B. Knobe, and J. Rothnie. Overview of the KSR1 Com-
puter System. Technical Report KSR-TR-9202001, Kendall Square Research, 1992. 
2. C. Carreras . Modelo Analítico de la Máquina de Difusión de Datos y Efecto de la 
Inclusión de Procesadores Multicontexto. P h D thesis, E.T.S.I.Telecomunicación, 
Universidad Politécnica de Madrid, Sep 1993. 
3. E. Hagersten, P. Anderson, A. Landin, and S. Haridi. A Performance Study of the 
DDM - A Cache Only Memory Archi tecture. Technical Report R91:17, Swedish 
Ins t i tu te of Compute r Science, Nov 1991. 
4. E. Hagersten, S. Haridi, and D. H. D. Warren. The Cache-Coherence Protocol of 
the Da ta Diffusion Machine. In M. Dubois and S. S. Thakkar , editors, Cache and 
Interconnect Archítectures in Multíprocessors. Kluwer Academic Publisher, 1990. 
5. E. Hagersten, A. Landin, and S. Haridi. DDM - A Cache-Only Memory Architec-
ture . IEEE Computer, 25(9):44-54, 1991. 
6. L. Kleinrock. Queueing Systems (Voluntes 1 and 2). John Wiley and Sons, 1975. 
7. E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik. Quantitative Sys-
tem Performance - Computer System Analysis Using Queueing Network Models. 
Prentice-Hall, 1984. 
8. M. Lofgren. A Símulator in C++ for a Parallel Architecture. P h D thesis, Swedish 
Ins t i tu te of Compute r Science, Nov 1990. 
9. A. Norton and G. F . Pfister. A Methodology for Predicting Multiprocessor Per-
formance. In Proceedíngs 15th Annual International Symposíum on Parallel Pro-
cessing, pages 772-781, 1985. 
10. S. Raina and D. H. D. Warren. Traffic Pa t t e rns in a Scalable Multiprocessor 
through Transputer Emulat ion. In Proceedíngs International Hawaii Conference 
on System Science, 1991. 
11. J. P. Singh, W.-D. Weber, and A. Gup ta . SPLASH: Stanford Parallel Applica-
tions for Shared-Memory. Technical Report CSL-TR-92-526, Compute r Systems 
Laboratory, Stanford University, Jun 1992. 
12. M. K. Vernon, R. Jog, and G. S. Sohi. Performance Analysis of Hierarchical Cache-
Consistent Multiprocesors. In Proceedíngs International Seminar on Performance 
of Dístríbuted and Parallel Systems, pages 111-126. North-Holland, Dec 1988. 
13. D. H. D. Warren and S. Haridi. Da ta Diffusion Machine - A Scalable Shared Vir-
tual Memory Multiprocessor. In International Conference on Fífth Generatíon 
Computer Systems. I C O T , 1988. 
14. A. W. Wilson. Hierarchical Cache /Bus Archi tecture for Shared Memory Multípro-
cessors. In Proceedíngs Utth Annual Symposíum on Computer Architecture, pages 
244-252, 1987. 
