We develop an analytical model of multiprocessor with private caches and shared memory and obtain the following results: the instantaneous state probabilities and the steady-state probabilities of the system. Both transient behaviour and equilibrium can be studied and analyzed. We showed that results can be applied to determine the output parameters for both blocking and non-blocking caches.
Introduction
Shared memory multiprocessors are widely used as platforms for technical and commercial computing [2] . Performance evaluation is a key technology for design in computer architecture. The continuous growth in complexity of systems is making this task increasingly complex [7] . In general, the problem of developing effective performance evaluation techniques can be stated as finding the best trade-off between accuracy and speed.
The most common approach to estimate the performance of a superscalar multiprocessor is through building a software model and simulating the execution of a set of benchmarks. Since processors are synchronous machines, however, simulators usually work at cycle-level and this leads to enormous slowdown [9] . It might take hours even days to simulate.
For memory structures relatively accurate analytical models were developed [3, 7, 9, 10] through extensive use of various queuing systems. Open queue system with Poisson arrivals and exponential service times is considered quite good for description of memory hierarchies [7] . Our focus is on the impact of the cache-coherence protocols on the overall system performance. The most commonly used technique for this purpose is the Mean Value Analysis (MVA) [3, 5, 7, 8, 9] . It allows the total number of the customers to be fixed (closed queue system), and this seems to be more adequate representation of the processes of self-blocking requestors [5] . Calculations of output parameters such as residency times, waiting times and utilization are shown in [3, 8, 9] . MVA is based on the forced flow that means in equilibrium output rate equals input rate. However, instantaneously, we can have input rate different from output rate, so that the instantaneous probabilities could be different from equilibrium [7] . MVA offers no possibility to study transient effects. Moreover, the assumption of exponential service times is not realistic, in fact all bus access times and memory access times are constants. It will be seen later in this paper that state probabilities depend on the server's time density function.
We use the technique of Markov processes to describe the behaviour of the multiprocessor implementing cache-coherence protocols.
Definition and Analysis of the Model
A multiprocessor consists of several processors connected together to a shared main memory by a common complete transaction bus. Each processor has a private cache. When a processor issues a request to its cache, the cache controller examines the state of the cache and takes suitable action, which may include generating bus transaction to access main memory. Coherence is maintained by having all cache controllers "snoop" on the bus and monitor the transaction. Snoopy cache-coherence protocols fall in two major categories: Invalidate and Update [2, 3, 10] . Invalidating protocols are studied here but the concepts can be applied with some modifications to updating protocols too. or may not include the memory block and the shared bus. Typical transaction that does not include memory block is Invalidate Cache Copy which occurs when a processor requests writing in the cache. All other processors simply change the status bit(s) of their on copies to Invalid. If the memory block is uncached or not clean it can be uploaded from the main memory, but in today's multiprocessors it is rather uploaded from another cache designated as Owner (O) (cache-to cache transfer). Memory-tocache transfer occurs when the only clean copy is in the main memory. A cache block is written back (WB) in the main memory (bus is used) when a dirty copy is evicted [6] . The bus and the main memory are also used when synchronization procedures are executed [2] . Apparently the bus can be considered as the bottleneck of the system.
In terms of the queuing theory processors can be viewed as customers (clients) and the bus can be viewed as a server.
Inter-arrival times are exponentially distributed with parameter λ . This assumption is adequate for most applications [7] . Requests are served on First Come First Served (FCFS) basis. Immediately after issuing a request for cache-to-cache transfer or synchronization procedure the customer blocks itself. The service time for blocking request has a density function f 1 (x). When service is completed the processor (customer) resumes processing with probability p or resumes processing and generates a new request with probability q (p+q=1). Details on how to obtain the input parameters are given in [2, 3, 8, 9] . This new request has a different density function f 2 (x) and corresponds to WB transaction. It does not block the customer but the server is held until completion of WB transaction therefore adding to the queue. The system can be in one of the following states: 1) N: all N customers are doing internal processing; 2) j, 1: j customers are doing internal processing (N-j are blocked respectively) and all requests are of type 1(0≤j≤N-1), 3) j ,2: j customers are doing internal processing , the server is serving request of type 2, and N-j customers are waiting in the queue for service of type 1 (0≤j≤N). The transitions between these states are illustrated in Fig. 1 .
Throughout this paper we use the following notations P N (t) Probability[all N customers are doing internal processing at time t] P j,i (t,x) Probability[j customers are doing internal processing, N-j are in the queue and/or in the server, and the server is busy doing service of type i at time t and the elapsed service time lies between x and x+dx ] P j,i (x) Probability[in the equilibrium state j customers are doing internal processing, N-j are in the queue and/or in the server, the server is busy doing service of type i and the elapsed service time lies between x and x+dx ] P j,i (t) Probability[j customers are doing internal processing, N-j are in the queue or in the server, the server is busy doing service of type i at time t] P N , P j,i steady-state probabilities.
Viewing the nature of the system, we obtain the following set of integro-differential equations
for i=1,2 having the following boundary and initial conditions 
for i=1,2
. Then from (10 and 11) we have after some transformations
Hence the solutions of (9-11) are
By integrating (12, 14, and 15) we obtain the LT of the instantaneous probabilities
Taking LT of (6-7) and using (8 and 12-15) we get after some transformations the following system of linear equations
Coefficients u j,i (s,0) can now be determined from the above equations. We can apply the final-value theorem to (16-19) to obtain the steady-state probabilities but it will require use of the L'Hopital rule and seems difficult and impractical [11] . Instead we set the following differential equations
for i=1,2. Equations (24-28) are to be solved under the following boundary conditions and normalizing condition
Analytical Model For a Multiprocessor With Private Caches And Shared Memory
The solutions of (2.29-2.32) are
For u j,i (0) and P 0,1 (0) we have
The coefficients u j,i (0) can be determined from (32) and (38-42).
Examples
In order to obtain the transient state probabilities first we have to determine P N (s) and P j,i (s) from (16-19) and (20-24) and then to apply the Inverse Laplace Transform to them. We used the packages of Maple 8 on a standard PC platform under Windows XP for these computations [12] . Results were produced and printed in less than a second. For N=4 the instantaneous probabilities are listed in Appendix A.
Various performance characteristics can be computed using the steady-state probabilities. For example, the average number of blocked customers (ANBC) in the case of blocking caches will be given by
In the case of non-blocking caches ANBC will be
where k is the ratio of average memory stall time [2] . k depends strongly on the application.
(1-k) actually refers to the fraction time the processor is consuming data while cache-to-cache or memory-tocache transfer is in progress.
In Appendix B we list the ANBC for two popular service time distributions: exponential and erlangian [1], for blocking and fully non-blocking caches (k=0). The time to solve (33-42) and calculate ANBC was meaninglessly short.
Concluding Remarks
This work presented a model for a shared bus, shared memory multiprocessor with private caches and captures the whole spectrum of Invalidate type cache coherence protocols. Although we started with fairly sophisticated set of integro-differential equations, the output of the model is a set of few linear equations from which the state probabilities can be determined.
The approach eliminates the main drawbacks of the most commonly used MVA analysis: inability to deal with transients and constraint on the service time distribution. The model gives insights into the transient behaviour of the system. Moreover, the assumption of exponentially distributed service times can be dropped; any continuous distribution can be used.
The ease of obtaining performance measures in a meaningless time makes very feasible the incorporation of the model in a multiprocessor design tool. *exp(-0.8072343638e-1*t)+0.3775725072e-2*exp(-0.1510201407e-1*t) -0.8400171763e-2*exp(-0.1398702636e-1*t)+0.4974831098e-2 *exp(-0.1256085210e-1*t)+0.5143708805e-3*exp(-0.1067946234e-1*t) -0.9181578239e-3*exp(-0.8161235321e-2*t), P 01 (t) = 0.9242283829e-5*exp(0-.1248619627*t)-0.3513395647e-4 *exp(-0.1089825679*t)+0.4257071327e-4*exp(-0.9494144284e-1*t) -0.1688587212e-4*exp(-0.8072343638e-1*t)-0.9200675435e-3 *exp(-0.1510201407e-1*t)+0.2810623081e-2*exp(-0.1398702636e-1*t) -0.3183584754e-2*exp(-0.1256085210e-1*t)+0.1611954912e-2 *exp(-0.1067946234e-1*t)-0.3239077071e-3*exp(-0.8161235321e-2*t) +0.5218152790e-5, P 42 (t) = 0.2709223908e-1+0.9859983387e-3*exp(-.1248619627*t) +0.1060558367e-2*exp(-0.1089825679*t)+0.1145474465e-2 *exp(-0.9494144284e-1*t)+0.1259769099e-2*exp(-0.8072343638e-1*t) -0.2412466943e-1*exp(-0.1510201407e-1*t)-0.1705507775e-2 *exp(-0.1398702636e-1*t)-0.2095511260e-2*exp(-0.1256085210e-1*t) -0.2029637013e-2*exp(-0.1067946234e-1*t)-0.1588776483e-2 *exp(-0.8161235321e-2*t), P 32 (t) = -0.2421204825e-3*exp(-0.1248619627*t)-0.7509940526e-4 *exp(-0.1089825679*t)+0.9576676158e-4 *exp(-0.9494144284e-1*t)+0.3013803504e-3 *exp(-0.8072343638e-1*t)+0.7126069503e-1*exp(-0.1510201407e-1*t) -0.6135152996e-1*exp(-0.1398702636e-1*t)-0.8971987351e-2 *exp(-0.1256085210e-1*t)-0.5950006752e-2*exp(-0.1067946234e-1*t) -0.4494935895e-2*exp(-0.8161235321e-2*t)+0.9428952497e-2, P 22 (t) =0 .2421271696e-2+0.2626333487e-4*exp(-0.1248619627*t) -.3154175021e-4*exp(-0.1089825679*t)-0.2945613244e-4 *exp(-0.9494144284e-1*t)+0.3412946115e-4*exp(-0.8072343638e-1*t) -0.8108903801e-1*exp(-0.1510201407e-1*t)+0.1349032466 *exp(-0.1398702636e-1*t)-0.4071010637e-1*exp(-0.1256085210e-1*t) -0.9622074403e-2*exp(-0.1067946234e-1*t)-0.5904604182e-2 *exp(-0.8161235321e-2*t), P 12 (t) = -0.1765308077e-5*exp(-0.1248619627*t)+0.4800731626e-5 *exp(-0.1089825679*t)-0.4448905932e-5*exp(-0.9494144284e-1*t) +0.1599282603e-5*exp(-0.8072343638e-1*t)+0.4177917201e-1 *exp(-0.1510201407e-1*t)-0.9973555226e-1*exp(-0.139870263e-1*t +0.7256040480e-1*exp(-0.1256085210e-1*t)-0.9747995399e-2 *exp(-0.1067946234e-1*t)-0.5300997812e-2*exp(-0.8161235321e-2*t) +0.4449749927e-3, P 02 (t) = -0.4618227199e-6*exp(-0.1248619627*t)-0.2203030325e-6 *exp(-0.1089825679*t)+0.4881890483e-7*exp(0-.9494144284e-1*t) -0.1699392257e-7*exp(-0.8072343638e-1*t)-0.8188760719e-2 *exp(-0.1510201407e-1*t)+0.2501502200e-1*exp(-0.1398702636e-1*t) -0.2833447693e-1*exp(-0.1256085210e-1*t)+0.1434663085e-1 *exp(-0.1067946234e-1*t)-0.2882912592e-2*exp(-0.8161235321e-2*t) +0.4449749927e-4.
Bibliography
In the above expressions e-i means 10-i for i=1,7.
Analytical Model For a Multiprocessor With Private Caches And Shared Memory 181 APPENDIX B 
