Markov analysis of multiple-disk prefetching strategies for external merging  by Pai, Vinay Sadananda et al.
Theoretical Computer Science 128 (1994) 211-239 
Elsevier 
211 
Markov analysis of multiple-disk 
prefetching strategies for external 
merging 
Vinay Sadananda Pai* 
Schlumberger DoweN, P.O. Box 2710, Tulsa, OK 74101, USA 
Alejandro A. Schgffer** 
Department of Computer Science, Rice University, P.O. Box 1892. Houston, TX 77251. USA 
Peter J. Varman*** 
Department of Electrical and Compuier Engineering, Rice University, P.O. BOS 1892, Houston, TX 
77251, USA 
Abstract 
Pai, V.S., A.A. Schlffer and P.J. Varman, Markov analysis of multiple-disk prefetching strategies for 
external merging, Theoretical Computer Science 128 (1994) 21 l-239. 
Multiple-disk organizations can be used to improve the I/O performance of problems like external 
merging. Concurrency can be introduced by overlapping I/O requests at different disks and by 
prefetching additional blocks on each I/O operation. To support this prefetching, a memory cache is 
required. 
Markov models for two prefetching strategies are developed and analyzed. Closed-form expres- 
sions for the average parallelism obtainable for a given cache size and number of disks are derived 
for both prefetching strategies. These analytic results are confirmed by simulation. 
1. Introduction 
Advances in processor architecture and integration technology have resulted in 
steady increases in processor speeds over the past several years. The performance of 
Correspondence to: P.J. Varman, Department of Electrical and Computer Engineering, Rice University, 
P.O. Box 1892, Houston, TX 77251, USA. Email: varman(@rice.edu. 
*Partially supported by an NSF Graduate Research Fellowship while at the ECE Department, Rice 
University. 
**Partially supported by NSF Research Initiation Award CCR 9010534. 
***Partially supported by NSF and DARPA Grant CCR 9006300. 
0304-3975/94/$07.00 0 1994-Elsevier Science B.V. All rights reserved 
SSDI 0304-3975(93)E0179-8 
212 V.S. Pai et al. 
I/O subsystems, in contrast, has generally not kept pace with these improvements in 
processor performance. The data rates possible from single disks are limited by 
physical considerations such as the speed of disk rotation and the rate of head 
movement, and are unlikely to increase dramatically. As a consequence, there have 
been a number of recent proposals for the use of multiple disks to form high- 
performance I/O subsystems [9,16,18]. Performance evaluation of different multiple- 
disk systems and associated management strategies have been studied in 
[S, 6,10, 12,17,19], for example. A number of analytic studies of I/O performance for 
specific computational problems have been undertaken previously in 
[l, 2,4,8,11,14,20-231. 
This paper suggests an analytic framework for the study of I/O parallelism by 
undertaking a specific case study of prefetching as a means for improving I/O 
performance in a multiple-disk environment. In particular, we study the trade-off 
between the average disk parallelism and the cache size for a single-pass merge of 
D sorted runs using D concurrent disk units storing the input. The system model 
consists of an infinitely fast CPU, a RAM-based disk cache with a capacity of 
C blocks, and D disks containing one sorted run each. On a single I/O operation, at 
most D blocks, one from each disk, can be read and transferred to the cache 
concurrently. 
The D-way merge algorithm operates as follows. A block from each run is brought 
into memory, and the records from each block are extracted and merged together in 
sorted order. When one of the input blocks is depleted, an I/O request is made for the 
next block from the run whose block was depleted. In a multiple disk system, the 
request for this next block can be overlapped with prefetching a block from each of the 
other disks. These prefetched blocks are held in the cache until they are required by 
the merge. If the cache is large enough to hold all prefetched blocks, then the total 
number of I/O accesses will be reduced by a factor of D over the case of a single disk, 
and a speedup of D in the I/O time will be expected.’ 
We use the block random depletion model [l l] in which the probability that the 
next block depleted comes from any particular run is uniformly l/D. A worst-case 
scenario of block depletion, in which all the blocks from run 1 are depleted, then all 
the blocks from run 2 are depleted, and so on, is not interesting from a prefetching 
point of view. Once the cache gets filled in the worst-case scenario, only the depleted 
block can be replaced, and there is no opportunity for parallelism. Therefore, we 
sought a random model of block depletion that would be useful to study average-case 
behavior, and we chose a model proposed previously in a related paper. 
Ideally, we would like the prefetching strategy to read D blocks on each input 
operation; however, in practice, the cache can become filled, and it may not have space 
to store D additional blocks. We would like to know the expected number of blocks 
fetched on an input operation, which we can think of as the average parallelism 
obtained by using a machine with D disks instead of 1 disk. In particular, we analyze 
’ The seek time reduction that arises when the runs are spread over several disks are not considered. 
Markov analysis of multiple-disk prefetching strategies 213 
the relationship between the size of the disk cache and the average parallelism for two 
different prefetching strategies that can be easily implemented with a standard 
multi-way merge algorithm. 
The motivation for the study is to provide a quantitative understanding of some of 
the trade-offs involved when resources like main memory are allocated to tasks in 
a typical DBMS system. We therefore concentrate on analyzing simple enhancements 
of the standard multiway merge algorithm that is commonly employed in such 
systems. The speeds of the source of the input data and the destination of the sorted 
output are assumed to be fast enough that merging is the bottleneck. Such situations 
arise often in practice, as when the input data is available on multiple disks, and the 
output is fed to a further processing stage. The larger questions of whether algorithms 
other than the multiway merge may be more suitable for sorting or merging data are 
not addressed here. With the same perspective, we assume contiguous placement of 
the sorted runs on the disks, as is done in several commercial DBMS systems like 
IBMs DB2, SQL/DS for instance. In contiguous placement, a run is placed on a single 
disk, and occupies consecutive blocks on that disk. In contrast, in block-interleaved 
placement, consecutive blocks of a run are interleaved across the D disks. Issues 
related to block interleaving, including the degree of hardware support provided, the 
possibility of partial striping and the granularity are still the subject of active research, 
and beyond the scope of this paper. 
Recent investigations of different methods for multiple-disk sorting include the 
works of Vitter and Shriver [23], and Nodine and Vitter [22,14], who have presented 
new algorithms for external sorting. Vitter and Shriver [23] have shown that a ran- 
domized version of distribution sorting can achieve an asymptotically optimal num- 
ber of I/OS. Vitter and Nodine have discovered an ingenious deterministic, mergesort- 
like algorithm called greed sort. This algorithm achieves optimal asymptotic perform- 
ance even for small cache sizes, but requires a larger constant in the number of I/OS 
required in any pass. More recently Nodine and Vitter [14] have discovered a deter- 
ministic version of distribution sort called balance sort, which reduces the constants 
associated with greed sort. While quite promising, these algorithms need to be 
evaluated to determine the ranges over which they are superior to mergesort espe- 
cially in real systems. For other works on the I/O complexity of sorting and merging, 
the reader is referred to the works of Kwan and Baer [ 1 l] and Salzberg [20] for 
single-disk systems, and to [15,24] for multi-disk systems. 
The two prefetching strategies are described in detail in Section 2. The first is 
a randomized (and greedy) algorithm that attempts to read one block each from as 
many disks as is possible, subject to the availability of free cache space. The second is 
a deterministic (and conservative) algorithm that reads concurrently only if all disks 
can be used; otherwise it does not perform any prefetching. In our analysis, we derive 
a closed-form expression for the average number of blocks read on an I/O operation 
as a function of the cache size and the number of runs. Surprisingly, the conservative 
strategy, which sometimes leaves available cache slots empty, attains slightly more 
parallelism than the greedy strategy for most reasonable choices of the number of 
214 VS. Pai et al. 
disks and cache size. This theoretical result is confirmed by our simulations using 
nominal-sized data. 
The rest of this paper is organized as follows. In Section 2 the two prefetching 
strategies are described in detail, and Markov models are developed for their analysis. 
Section 3 contains some mathematical definitions and combinatorial results used in 
the analysis. The first prefetching strategy is analyzed in Section 4 and the second in 
Section 5. In Section 6 we present simulation results which validate the analysis, and 
we conclude with Section 7. 
2. System models 
In this section we describe two prefetching strategies that can be used to improve 
the I/O performance in implementing the multiway merge. For both these strategies 
we develop Markov models which are analyzed in later sections. 
During the merge, whenever a block is depleted from run i, the cache is checked to 
see if the next block from run i is in the cache. If no block from run i is in the cache, 
a fetch operation is required. The block needed will be referred to as the demand-fetch 
block. To exploit the parallelism possible with the D disks, blocks from other disks can 
be prefetched along with the demand-fetch block, provided sufficient space is available 
in the cache to accommodate them. 
The two prefetching strategies we analyze will be referred to as the randomized 
prefetch model and the deterministic prefetch model, respectively. Both models have 
identical behavior when sufficient cache memory is available to fetch a block from 
each disk. In this case, D - 1 blocks are prefetched along with the demand-fetch block. 
The models differ when the number of free cache blocks, F, is less than D- 1. In the 
randomized model, F of the D - 1 disks are randomly chosen with equal probability; 
the demand-fetch block is fetched along with one block from each of these F disks. In 
the deterministic model, no prefetching is performed; only the demand-fetch block is 
fetched. The deterministic model fetches either D blocks or only 1 block, whereas the 
randomized model fetches between 1 and D blocks, depending on the amount of cache 
space available. 
An I/O operation that fetches at most D blocks, one from each disk, will be charged 
unit cost. The D disks can independently fetch blocks from different disk locations, but 
they will be assumed to initiate and complete I/O operations as a group. This 
assumption relaxes any temporal transients without invalidating the cache behavior 
being studied. If the full parallelism of the D disks can be used on every I/O operation, 
the cost of the merge operation will be reduced by a factor of D over the single-disk 
case. Ideally, a speedup of D can be obtained with a cache of unbounded size. As the 
cache size is reduced, the actual speedup will be lower, depending on the average I/O 
parallelism of each I/O operation. 
The two prefetching models are described by the pseudocode in Fig. 1. Initially, the 
cache is loaded with one block from each run. At the start of any iteration of the loop, 
Markov analysis of multiple-disk prefetching strategies 215 
/*C is the cache size in blocks. D is the number of runs */ 
Initial state: 
Add the first block from each run to the cache. 
for (i= 1 ,...,D) a[i]=l; 
/* a[i] represents the number of blocks from run i that reside in the cache. */ 
num-free-cache = C - D; 
do forever 
/* RECORD STATE HERE*/ 
Randomly choose a run, j, from which to deplete a block. 
a[j]=a[j]-1; 
if (a [j]=O) (/* A run has emptied */ 
if (num_free_cache 2 (D - 1)) { 
Fetch D blocks, one from each run; /*f-transition */ 
for (i= l...D) a[i]=a[i]+ 1; 
num_free_cache = numhee-cache - (D - 1); 
/* Only (D - 1) new prefetches have been added. */ 
/* The demand-fetch replaces the cache block depleted from run j. */ 
else { /* Less than (D - 1) free cache blocks */ 
switch (prefetching model) { 
case DETERMINISTIC: 
Fetch 1 block from run j; /* r-transition */ 
a [j]=a[j]+l; 
/* The demand-fetch replaces the cache block depleted from run j. */ 
break; 
case RANDOM: 
Let F = num-free-cache; 
Randomly choose F runs from the D - 1 runs, i = 1.. . D, i #j 
Let these be iI, iz,...,ir 
Fetch 1 block from run j and 1 block from each of these F runs; 





else /* Next block of run j available in cache */ 
num_free_cache = num_free-cache + 1; /* d-transition */ 
> 
Fig. 1. System model: random and deterministic prefetching models. 
216 VS. Pai et al. 
Prefetched blocks 
- Blocks being merged 
(a) Cache state: [ 6, 1,4,2,5 ] 
( (b) Cache state: [ 5, 1.4.2.5 ] (c) Cachestate:[7,1,5.3,6] (d) Cache state: [ 6, 1,4,2,5 ] 
Fig. 2. (a) A sample cache state for a system with D= 5 disks. (b) Cache state after a d-transition. 
(c) Cache state after an f-transition. (d) Cache state after an r-transition. 
at least one block from each run will be present in the cache. A random model of block 
depletion is assumed [ 111. The leading block from each run is a candidate for being 
depleted next. One of the D leading blocks is chosen with equal probability l/D, and 
depleted. If that run still has blocks present in the cache, no I/O operation is required 
(we call this a d-transition). However, if that run has no more cached blocks, an I/O 
operation will be required to retrieve the next block from that run before the merge 
can continue. If there is enough cache space to read a block from each disk, D blocks 
are fetched (called an f-transition); if not, the action taken is different for the two 
prefetching models. In the deterministic model only one block from the depleted run is 
fetched (called an r-transition), while in the randomized model enough blocks to fill up 
the cache are fetched (called a p-transition). 
The dynamic behavior of both systems can be modeled mathematically by Markov 
chains. Each state in the model will represent the number of cache blocks allocated to 
each run at the start of each iteration of the loop in Fig. 1. Figure 2(a) shows a sample 
cache state for D= 5 disks. Each execution of the loop of the program causes 
a transition to a different state. 
To describe the states of the system and the transitions, we use the following 
notation. 
Definition 2.1. Let C denote the size of the cache in blocks, and D the number of 
runs. Let s=[s, ,... ,si, . . . . sD] be an integer vector with D components. Define 
6(s)=(C-x:= 1 Si) to be the number of free cache blocks in s. Define y(s) to be the 
number of components of s that are 1. 
Every cache state will be a D-component integer vector. From the program one can 
see that any vector that represents a cache state must satisfy Definition 2.2 below. 
Markov analysis of multiple-disk prefetching strategies 217 
Every component must be at least 1, at least one component must be equal to 1, and 
the number of free cache blocks must be between 0 and C-D. However, not all 
vectors satisfying these three conditions are necessarily reachable from the initial 
state. The set of reachable states differs for the two prefetching models; the reachable 
states are characterized precisely in Sections 4 and 5. 
Definition 2.2. A cache state is an integer vector s = [sI ,. . . , sj ,. . . , sD] such that every 
component Sj31 for l<j<D, y(s)>,l, and 0<6(s)dC-D. 
The transitions from state s = [sI ,. . , Sj ,. . . , sD] to state t can be characterized as 
follows: 
(1) Depletion (d-transition): A block is depleted from run j, and all runs still have 
at least one cached block. This transition is enabled only if sj> 1. Following the 
transition, t= [IsI ,. ., sj- l,..., SD] (Fig. 2(b)). 
(2) Fetch (f-transition): A depletion from run j occurs, and run j no longer has any 
cached blocks. A fetch is required, and all D blocks are cached. This transition is 
enabled only if sj = 1 and 6(s) > D - 1. Following the transition, t= [s, + 1,. . ., si- 1 + 1, 




Fig. 3. The Markov chain for the randomized prefetching model with D = 3 disks and C= 7 cache blocks. 
218 VS. Pai et al 
(0 ftte 
‘2 r-transition ‘..._.. cache blocks) 
- d-uansidon 0 Invalid State 
Fig. 4. The Markov chain for the deterministic prefetching model with II = 3 disks and C = 7 cache blocks. 
(3) Replenish (r-transition): A depletion from run j occurs, and run j no longer has 
any cached blocks. However, the cache does not contain enough free blocks for a fetch 
of D blocks. Instead, the next block from run j is fetched. This transition is enabled 
only if sj= 1 and 6(s)<D- 1. Following the transition, t=s (Fig. 2(d)). 
(4) Partial fetch (p-transition): A depletion from run j occurs, and run j no longer 
has any cached blocks. However, the cache has only F CD - 1 free cache blocks. The 
next block from run j and one block from each of F runs chosen randomly from the 
other D- 1 runs are fetched. This transition is enabled only if sj= 1 and 6(s) <D- 1. 
A p-transition always fills the cache, since exactly F =6(s) blocks are fetched in 
addition to the block from run j. 
Every state has exactly D possible depletions that initiate a transition to another 
state. Each depletion is equiprobable with probability l/D. In most cases, the de- 
pletion determines which transition is made, and the transition also has probability 
l/D. The only exception is when a p-transition occurs in the randomized model. 
Figure 3 shows the Markov chain for the case of 3 disks and 7 cache blocks for the 
randomized model, while Fig. 4 shows the Markov chain for the deterministic model. 
In both figures the three applicable types of transitions are shown. The three states 
marked as invalid in Fig. 4 cannot be reached from the initial state. Also, states have 
Markov analysis of multiple-disk prefetching strategies 219 
been labeled as being either interior or edge states. These classifications are useful in 
Section 5. 
Most states and transitions look the same in both Markov chains. For example, in 
both Markov chains, the state [1,2,2] has three outgoing transitions, each with 
probability l/3. If a block is depleted from the second or third runs, a d-transition is 
made to [I, 1,2] or [1,2,1], respectively. If a block is depleted from the first run, an 
I/O operation is required, since the first run has no cached blocks. Since [ 1,2,2] has 
2 free cache blocks, an f-transition will be made, leading to state [i, 3,3]. The 
difference between the chains can be seen by considering the state [3,2,1]. In the 
randomized model [3,2, l] has four outgoing transitions, while in the deterministic 
model it has three. If the leading block from the third run is depleted, the models differ. 
In the randomized model, two p-transitions are possible, to [4,2,1] or to [3,3,1]. The 
choice of p-transition depends on a random decision on whether to use the last free 
block in the cache for a block from the first run or the second. Each occurs with equal 
probability (l/2), so the probability of the two p-transitions is (l/3)(1/2)= l/6 for each. 
In the deterministic model, the depletion of the lone block from the third run forces an 
r-transition back to [3,2, 11; this r-transition has probability l/3. 
3. Mathematical preliminaries 
In this section, some known combinatorial results about binomial coefficients and 
integer partitions are summarized. These results are used, in turn, to derive related 
mathematical formulas that will be applied directly in the analysis of Sections 4 and 5. 
(The reader may prefer to skim through this section and return to it when the results 
are used in subsequent sections.) 
Lemma 3.1 (Graham et al. [7, p. 1691). 
integers p, q 3 0, n 3 q 3 0. 
Lemma 3.2 (Graham et al. [7, p. 1691). 
,i, (m~k)(,,~k)=(~~~) integers m2n,pTq’ 
Lemma 3.3 (Graham et al. [7, p. 1601). 
$m (:)=(:I:) integers m2n>O. 
Lemma 3.4 (Graham et al. [7, pp. 173-1751). 
i0 [A]=(‘) integers D-l ak>O. 
220 VS. Pai et al 
Definition 3.5 (Andrews 13, p. 11). A partition of a positive integer n is a finite 
nonincreasing sequence of positive integers I.,, ,I2 ,. . ., i., such that I’!=, Ai = n. The 
l.i are called parts of the partition. 
Definition 3.6 (Andrews [3, p. 541). A composition is a partition in which the order of 
the summands is considered. A composition with m parts will be denoted by 
cc12 cz ,. . . , cm], in which the Ci are the parts of the composition; note that Ci 3 1 for 
1 Qidm. 
The following results on compositions can be found in the text by Andrews [3]. 
Lemma 3.1 (Andrews [3, pp. 54 and 631). The number of compositions of n with 
m parts, denoted by c(m,n), and the number of compositions of n with m parts in which 
each part is greater than or equal to i, denoted by ci(m, n), are given by 
c(m, n) = 
n-l ( ) m-l ’ 
Ci(JYI, n) = ( n-(i- l)m- 1 ) m-l ’ 
The following result follows from Lemma 3.7. 
Lemma 3.8. The number of compositions of n with m parts of which exactly k parts are 
equal to 1 is given by 
;Tj (::;I t) 
if k=m=n, 
otherwise. 
Proof. For n = m, the number of compositions satisfying the lemma is 1 if k = m and 
0 otherwise. For n #m, the number can be found as follows. The k parts that are 1 can 
be chosen in (T) ways. The other m-k parts must not equal 1, and they must sum to 
exactly n-k. Lemma 3.7 gives the number of such possible combinations. Conse- 
quently, the desired sum is 
0 z c2(m-k,n-k)= (n-k)-(m-k)- 1 (JR - k) - 1 
Definition 3.9. Let s be a composition of n. Denote by y(s) the number of parts of 
s that are 1. 
Markoc anal_vsis of multiple-disk prefetching strategies 221 
Definition 3.10. Let f,(n) be the set of all compositions of II with m parts, in which at 
least one part is a 1. That is, T,(n)= (s= [ sI,s2,...,smll~;=l Sk=n,skal, y(s)>l). 
Lemma 3.11. jTm(n)[ =(:I:)-(“,“_T’). 
Proof. The number of compositions in T,(n) is the difference between the number of 
compositions of n with m parts and the number of compositions of n with m parts with 
every part 22. By Lemma 3.7 the first quantity is c(m, n)=(iTi), and the second is 
c2(m,n)=(“,“_;‘). 0 
Lemma 3.12. C,,r,(j,Y(s)=m(~12,). 
Proof. The total number of parts that are 1 must be counted in all compositions in the 
set r,(j). Let Si be the subset of compositions in r,(j) in which part i equals 1, 
1< idm. The total number of ones in all compositions in I’,(j) with part i equal to 
one is clearly IS,/. Hence, the desired result can be computed, by summing )SiI, for all i, 
1 didm. 
Every composition in Si has part i equal to 1, and the remaining m- 1 parts must 
sum to j - 1. Therefore, the number of such compositions is c(m- 1, j- 1). Substituting 
for c(m- 1, j- 1) from Lemma 3.7, the resulting expression is Cser,CiI~(~)= 
c~=“=llSil=C~=n=1(~_22)=m(~_2,). 0 
Definition 3.13. Let Y,(n) = Uj”=,f,( j). 
Lemma 3.14. 1 Y,(n)1 =(z)-(nim). 
Proof. Since T,(i) and r,(j) are disjoint for i#j, I !P,Jn)l can be found by wnming 
/I’&)(, for all j between m and n: 
The terms in the second sum are 0 for m< j<2m- 1, and, hence 




Lemma 3.3 can be used to evaluate both the sums of equation (3), thereby proving the 
lemma. 0 
222 VS. Pai et al. 
Lemma 3.15. CsCY’,Cnj~(~)=m( :I:). 
Definition 3.16. A Markov chain is irreducible if there is a path of transitions from any 
state to any other state. A Markov chain is recurrent nonnull if for every state, the 
mean time to return to that state is finite. A Markov chain is aperiodic if for each state 
there exists a number k such that, for all j> k, returning to the state can be ac- 
complished in exactly j transitions. 
Theorem 3.17 (Molloy [ 13, p. 1281) (Ergodic Theorem). A discrete-time Markoc chain 
is said to be ergodic if it is irreducible, recurrent nonnull, and uperiodic. ~‘f a Markoz: 
chain is ergodic, there exists a unique limiting distribution for the probability of‘ being in 
a state k denoted as x(k), independent of the initial state. These probabilities are called 
the steady-state or equilibrium probabilities. 
To find the steady-state probabilities we can solve the steady-state equation 
EM = F, in which 71 is the vector of steady-state probabilities and M is the single-step 
transition probability matrix. 
4. Analysis of the randomized prefetching model 
In this section we analyze the randomized prefetching model, with the aim of 
deriving an expression for the average parallelism of an I/O operation in the steady 
state. Towards this end, the state space is precisely characterized in Lemma 4.1, and 
the Markov chain is shown to be ergodic in Lemma 4.3. The steady-state probabilities 
are then derived in Theorem 4.4. Finally, a closed-form expression for the average I/O 
parallelism is derived in Theorem 4.6. 
In the randomized prefetching model, additional blocks are always prefetched into 
the cache if possible. When there are insufficient free cache blocks to fetch from all 
D runs, the runs from which to prefetch are chosen at random from the D - 1 runs not 
containing the demand-fetch block. 
Lemma 4.1. Every integer uector [s,,s, ,. ..,. D F ] satisfying the conditions of Definition 
2.2 represents a cache state that can be reached from the initial state [l, l,..., 1). 
Proof. Let s be a vector satisfying the conditions of the lemma. Without loss of 
generality assume that s1 = 1. Choose s’ to be another integer vector with the 
properties that 
(1) s;=1, 
(2) 6(s’) = 0; i.e., the cache is full, 
(3) SI~Si, 1 <i<D. 
Since s’ majorizes s along every component (property (3)), s can be reached from s’ 
by exactly 6(s) d-transitions. The lemma can be proven by showing that every state s’ 
satisfying the three properties is reachable from the initial state. 
Markov analysis of multiple-disk prefetching strategies 223 
Our first goal is to reach some state with a full cache. We take L (C -D)/(D - 1) J 
f-transitions (always depleting the first component) to reach a state t in which 
0<6(t) < D - 1. If S(i) >O, any p-transition that depletes the first component is taken 
to reach a state with the first component equal to 1 and no free blocks. 
Now, suppose that we want to reach a state q’ with q; = 1 and b(q’)=O from state 
q with q1 = 1 and d(q)=O. Define the distance between q and q’ as follows: 
dist(q,q’)= 5 jqi-q;j. 
i=l 
Since q # q' and 6(q) = 6(q’) = 0, there exist parts i and j such that qi > qi and qj < qi. 
Take a d-transition that depletes part i and a p-transition that increments part j to 
reach state q” such that dist(q, q”) = dist(q, q’) - 2. Repeat this process by replacing 
q with q” and choosing possibly different values of i and j; after +dist(q, q’) repetitions, 
state q’ will be reached. 0 
The previous lemma can be used to count the number of states in the Markov chain. 
Lemma 4.2. The number of states in the Markov chain is (g)-(‘iD). 
Proof. By Lemma 4.1 and Definitions 3.13 and 3.10, the set of states is in one-to-one 
correspondence with the set ‘Y&C). Therefore, the number 
Lemma 3.14. 0 
of states is given by 
Lemma 4.3. The Markov chain for the randomized prefetching model is ergodic. 
Proof. Let s and I be two arbitrary states. To make a path from s to t, we first go from 
s to the initial state [l, l,...l] by a sequence of d-transitions. Then we add the path 
from the initial state to t that exists by Lemma 4.1. Therefore, the Markov chain is 
irreducible. 
Each state s has a finite nonzero probability of returning to itself, since there is 
a finite-length, nonempty path from s to itself, and all state transitions have finite, 
nonzero probabilities. Therefore, the Markov chain is recurrent nonnull. 
To show that the Markov chain is aperiodic, note that there is a path from any state 
s to itself that passes through a state with no free blocks. In a state with no free blocks, 
any number of p-transitions can be taken without changing state. As a result there is 
a constant k such that for any j> k, there is a path from s to itself. 0 
Since the Markov chain is ergodic, Theorem 3.17 asserts that each state has 
a unique steady-state probability. Surprisingly, every state has the same probability. 
224 V.S. Pai et al. 
Theorem 4.4. Every state has steady-state probubility l/[(g)-(“;“)I. 
Proof. Consider the steady-state equation EM = n: mentioned above. To show that the 
equation has a solution in which all components of 3 are equal, it suffices to show that 
every column in M sums to 1. In other words, for every state s the sum of the 
probabilities on transitions coming into s is 1. The proof contains three cases 
depending on the structure of the state s. 
Case 1: d(s)>0 and 2<y(s)<D. 
Since a p-transition always fills the cache, and since s has a nonzero number of free 
cache blocks, s cannot be reached by a p-transition. Also, s cannot be reached by an 
f-transition, because an f-transition always leads to a state f with y(t) = 1. Thus, s can 
be reached by only d-transitions. 
Since r(s)> 1, for each component i, 1~ i<D, there is a d-transition from 
Cl, S2,...,Si-I,Si+l,Si+1,..., sD] to s. Each such transition has probability l/D; there- 
fore, the total probability on transitions into s is D(l/D) = 1. 
Case 2: 6(s) > 0 and y(s) = 1. 
Without loss of generality assume that s1 = 1. As in case, 1, s cannot be reached 
by a p-transition. However, s can be reached by an f-transition from 
[l, s2 - 1, s3 - 1, . ., sD- 11; this f-transition has probability l/D. Also, for each i, 
26ibD, there is a d-transition with probability l/D from [s1,s2, . .., si- 1, si+ 1, 
si+ i, . . , s,,] to s. Note that s cannot be reached by a d-transition by depleting si, since 
the previous state would have no parts equal to 1. Thus, s can be reached by D- 1 
d-transitions and 1 f-transition, each with probability l/D for a total incoming 
probability of l/D+(D- 1)(1/D)= 1. 
Case 3: S(s)=0 and Y(s)=D-k. 
Since s has no free blocks, s can only be reached by an f-transition or a p-transition. 
Therefore, if there is a transition from s’ to s, s majorizes s’ componentwise. In 
particular, if si= 1, then s;= 1, for any component i. Furthermore, if S(s’)=j, then s’ 
differs from s in exactly j components, and for each such component, s: = si - 1; we say 
that such an s’ is component compatible to s. From any s’ that is component 
compatible to s, s can be reached from s’ by either a p-transition, if 6(s’) < D - 1, or an 
f-transition, if 6(s’) =D - 1. Since s has k parts not equal to 1, there are exactly (r) 
states with j free blocks that are component compatible to s. 
Suppose s’ is component compatible to s and 6(s’)=j. The probability that s is 
reached from s’ by a p-transition or an f-transition depends on the occurrence of two 
events. First, the component of s’ whose depletion caused the transition must be one of 
the D-k parts of s which is 1. Secondly, the j blocks that are prefetched must add to 
those parts i for which si = si- 1. Since the next depletion is chosen with equal 
probability from the D parts, the probability of the first event is (D- k)/D. The 
probability of the second event given the choice of the depleted component is l/(DT1), 
because the j parts are chosen at random from the D- 1 other parts. 
The total probability on transitions into s is the sum of the product of the number of 
component compatible states with j free blocks and the probability of the transition 
Markou analysis of multiple-disk prefetching strategies 225 
into s: 
By Lemma 3.4, this quantity is equal to 1. q 
The aim of this analysis is to determine the relation between the size of the cache 
and the average number of blocks fetched on each I/O operation. Since an I/O 
operation is performed on either an f-transition or a p-transition, these transitions are 
referred to as I/O transitions. 
Lemma 4.5. Let z(t) denote the steady-state probability of a state t of the Markov 
chain, and let n(t) be the number of blocks fetched on an I/O operation from t. The 
average number of blocks fetched on an I/O operation in the steady state is 
where both sums are taken over all states in the Markov chain. 
Proof. An I/O transition from a state t occurs if and only if a component equal to 1 is 
depleted. Since t has y(t) parts that are 1 and each of the D parts is depleted with 
probability l/D, the probability of an I/O transition from t is the product of l;(t)/0 and 
the probability of being in state t (i.e. n(t)). Therefore, the conditional probability that 
an I/O transition is taken from t, given that some I/O transition is taken is 
yWWCsr(s)W. S ince the number of blocks fetched on an I/O transition from t is 
n(t), the lemma follows. q 
Theorem 4.4. For the randomized prefetching model, the average number of blocks 
fetched on an I/O operation in the steady state is 
Proof. By Theorem 4.4, n(t) is the same for every state t in the Markov chain. The 
number of blocks n(t) fetched on an I/O transition from t is min(D, d(t) + l), since 
D blocks are fetched if d(t)2 D- 1 and 1+8(t) are fetched otherwise. 
Thus, the average number of blocks read on any I/O operation is 
C,r(t) min(D, W)+ 1) 
&Y(S) 
in which both sums are taken over all states in the Markov chain. 
A closed form in terms of binomial coefficients will be derived for (4); first, a closed 
form will be derived for the denominator. By Lemma 4.1 and Definitions 3.13 and 3.9, 
226 VS. Pai et al. 
the denominator is the total number of parts equal to 1 in the set Yu,(C). By Lemma 
3.15, this quantity is 
D 
To find a closed form for the numerator, the number of parts equal to 1 must be 
summed over all states with C-j free blocks. By Lemma 3.12, this quantity is D( LZ',). 
The number of blocks fetched is D if j< C- D and (C-j+ 1) otherwise. In the first 
case, write D as (C-j+ 1)-(C-j-D + 1). The number of blocks fetched is deter- 




Before Lemma 3.1 can be used to simplify (6), several transformations must be made. 
The lower bound of both summation indices can be extended to j =O, since the 
binomial coefficient is 0 when j < D. Additionally, the substitution a = ( ? ) will be made, 
resulting in 
Change the index of summation by making the substitution i = j - 2. The upper bound 
on index i can be extended by 1 without changing the total sum. The result is 




By combining the denominator, (5), and the numerator, (8), the theorem is 
proved. q 
5. Analysis of the deterministic model 
In this section the Markov chain for the deterministic prefetching model is ana- 
lyzed. Once again, an expression will be derived for the average parallelism of an I/O 
operation in the steady state. The organization of the section parallels that of Section 
4. The state space of the Markov chain is characterized in Theorem 5.4, and shown to 
be ergodic in Lemma 5.7. The steady-state probabilities of the states in the Markov 
chain are then derived in Theorem 5.8. Finally, a closed-form expression for the 
average I/O parallelism is derived in Theorem 5.11. 
Markota analysis of multiple-disk prefetching strategies 227 
Recall that in the deterministic prefetching model a block is fetched from all D runs 
if there is sufficient cache space. However, when there are insufficient free cache blocks 
to fetch from all D runs, only one block - the demand-fetch block - is fetched. 
The state spaces of the Markov chains for this model and the randomized model 
differ. In particular, Theorem 5.4 will show that not every state satisfying Definition 
2.2 is reachable from the initial state. 
Lemma 5.1. Any state s with “/(s)>d(s)+2 is not reachable from any other state. 
Proof. Since y(s) 3 2, s cannot be reached by an f-transition, since an f-transition into 
s requires v(s)= 1. Hence, s must be reached by a d-transition from some state, say t. 
Then, S(t)=a(s)- 1 and either ~(t)=?(s)- 1 or y(t)=?(s). In either case, y(t)36(t)+2. 
The lemma is proved by induction on 6(s). The lemma holds for a state s with 
6(s) =0 and y(s)>2, since there are no d-transitions into a state with zero free cache 
blocks. 
Now assume inductively that the lemma holds for all states w where 6(w) <6(s). As 
shown above, any state t with a transition to s satisfies ;t(t)>6(t)+2, and &t)<6(s). 
Hence, t satisfies the induction hypothesis, and so t is unreachable. Hence, s is 
unreachable. Therefore, the lemma holds for all w, ~Y(rv)<fi(s). q 
Figure 4 shows the Markov chain for the deterministic model. State [l, 1,5] is 
marked as invalid, i.e. it is not reachable from the initial state. This conforms to 
Lemma 5.1, since this state has zero free blocks and two components which are 1. 
Lemma 5.2. Lets=[1,s2,s3, . ..) Sj-l,l,l, . . . . l] be a state such that 6(s) 3 (D - 1) + nj, 
njL1,andlettbethestate[1,~2,~g,...,si-1, 1+ nj, 1, . . . , 1). Then, there is a sequence of 
transitions that takes the system from s to t. 
Proof. We first describe a sequence of transitions dj that lead from s to a state in 
which component sj is incremented by 1. First, s1 is depleted, resulting in an 
f-transition since 6(s) > (D - 1) + nj. From this state, every component, except 1 and j, 
is successively depleted once each by a d-transition, leading to a state 
s’ = Cl, s2, s3, . . , sj_r,2,1,...,1]; notice that 6(s’)B(D-l)+(nj-l), and hence, the 
sequence Aj can be applied once more to s’. After a total of nj such applications the 
state t is reached. 0 
Lemma 5.3. Every integer vector s= [sl,sI, . . . . s,J that satisfies Dejinition 2.2 and 
satisjies r(s),<&s)+ 1 is reachable from state c=[l, 1, . . . . 11. 
Proof. Without loss of generality we rearrange the components of s in nondecreasing 
order. That is,s=[l, . . . . l,S~+~,S~+~,...,S~]suchthatsi~si+,forldi~D.Asaresult 
of this reordering, si= 1 for 1 <i<k, k=?(s), and si32 for k+ l<i<D. 
228 VS. Pai et al 
Since k = Y(S) < 6(s) + I, we have 6(s) 3 k - 1. We also have the following bound on 
on 6(c): 
6(C)=6(S)+ 5 (Si-1) 
i=k+ I 
=8(s)+ f (Si-2)+(0-k) 
i=k+l 
D 
>k-l+ 1 (si-2)+(D-k) 
i=k+l 
=(D-I)+ : h-2). 
i=k+l 
Inequality (9) ensures that we can apply the sequence of transitions of Lemma 5.2 to 
each of the components ck+ 1 through cD of c in turn, to get from c to a state 
s’=[l,... l,s,‘+i--1,&+2-l,..., SD - 11. Also, 6(s’) 30 - 1. To reach s from s’, an 
f-transition is made by depleting s;, and then components 2 through k are depleted. 
Since s was an arbitrary state satisfying the conditions of the lemma, the proof is 
complete. U 
Combining Lemmas 5.1 and 5.3, we have the following theorem. 
Theorem 5.4. In the Markov chain for the deterministic model, the set of all states 
reachable from the initial state are precisely those described in Lemma 5.3. 
To derive the steady-state probabilities and the average I/O parallelism, it is useful 
to classify the states of the Markov chain as follows. 
Definition 5.5. A state s in the Markov chain is an interior state if the number of free 
cache blocks in s is at least D - 1, i.e. 0 d S(s) f D - 1. A state that is not an interior 
state is an edge state. 
Definition 5.6. Let I denote the set of all interior states and E denote the set of all 
edge states in the Markov chain. Let E,,,, 0 < f < D - 2, 1 d k d f+ 1 denote the subset 
of edge states that have f free cache blocks and have k parts which are 1. 
Before we derive the steady-state probabilities of the states in the Markov chain, we 
establish that they exist and are unique. This can be done by showing that the Markov 
chain is ergodic. 
Lemma 5.7. Each valid state s has a unique steady-state probability. 
Proof. Theorem 5.4 defines the set of states in the Markov chain for the deterministic 
model. The proof of ergodicity closely parallels the proof of Lemma 4.3. 
Markoc analysis of multiple-disk prefetching strategies 229 
Let sand t be two arbitrary states. To find a path from s to t, use d-transitions to get 
from s to the initial state [l, 1, . . . . 11, and then use a path from the initial state to t, 
which exists by Lemma 5.3. Therefore, the Markov chain is irreducible. 
Each state s has a finite nonzero probability of returning to itself, since there is 
a finite-length, nonempty path from s to itself, and all state transitions have finite, 
nonzero probabilities. Therefore, the Markov chain is recurrent nonnull. 
To show that the Markov chain is aperiodic, note that there is a path from a state 
s to itself that passes through an edge state. In an edge state, any number of 
d-transitions can be made without changing state. As a result, the Markov chain is 
aperiodic. 
Hence, the Markov chain is ergodic. By Theorem 3.17, every state has a unique 
steady-state probability. 0 
We derive below the relative steady-state probabilities of the states, rather than the 
absolute probabilities. Each relative probability must be scaled by a normalization 
constant, a, to obtain the actual steady-state probability of that state. CI can be chosen 
uniquely so that the sum of the probabilities of all states is 1, but we do not need to 
compute it explicitly. 
Theorem 5.8. The steady-state probability of state s, (denoted by z(s)), is given by 
c cc(D-1) if sgl, 
I cc(f+ l)!(D-k- l)! . (f-k+ l)!(D-2)! If seEf,k 
in which a is the normalization constant. 
Proof. The proof will consist in verifying that the probabilities stated in the theorem 
satisfy the equation CM = IZ, in which E is the vector of steady-state probabilities in the 
theorem and M is the single-step transition probability matrix given by the transition 
rules defined earlier. In particular, the probabilities of all incoming transitions to 
s weighted by the steady-state probability of the originating state will be summed and 
will be shown to equal Z(S). 
The proof consists of four cases depending on the structure of state s. Figure 
5 shows the transitions for the different types of states. Each transition shown has l/D 
probability of being taken, except those transitions labeled as having n occurrences 
which have n/D probability. 
Case 1: Sol. We show that z(s)=cl(D- 1). 
Figure 5(a) shows the incoming transitions to an interior state s with r(s)= 1. State 
s can be reached by D- 1 d-transitions or an f-transition. The f-transition must 
be from an interior state, say t, and n(t)=cl(D- 1). The d-transition must be from 
a state U, where either UCZ or u is an edge state with 6(u)= D-2. In both cases, 
230 VS. Pai er al. 
(D-l) d-transmons (b) (D) d-transitions 
(D-l) d-mnsitions (D-i) d-transoms 
Cl (k) d-transitions 
from l%f-l.k-I) 
(D-k) d-transitions 
f-transition (D-k) d-transitions f-wmsition (D-f-l) d-nansitions 
Fig. 5. (a) An interior state with exactly one part equal to 1. (b) An interior state with i parts equal to 1. 
(c) An edge state with no free cache blocks. Exactly one part can be equal to 1. (d) A state in E,, 1 withj> 1 
free cache blocks and 1 part which is 1. (e) A state in E,,, withf> 1 free cache blocks and k > 2 parts which 
are 1. (f) A state in JY~,~+ 1withf2 1 free cache blocks andf+ 1 parts which are 1. 
x(u)= x(D- 1). Summing the weighted probabilities on the input edges we get: 
M(~/D)(D)(D-l)=a(D-1). 
Figure 5(b) shows an interior state s with 2 <y(s) dD. Such a state cannot be 
reached by any f-transitions. All incoming transitions are d-transitions, each having 
probability l/D. Any of these d-transitions originates from a state (say u) where either 
UEI or u is an edge state with 6(u) = D - 2. In either case, n(u) = a(D - l), and hence, 
x(s) = cc(D - 1). 
Case 2: SEEM,,. We show that n(s)=a(f+ 1). 
Figure 5(c) shows the incoming transitions to a state SEE,,,. The sum of the 
incoming probabilities is cx(f+ 1) (l/D) + a(D - 1) (l/D) = x, where the first term is due 
to the r-transition, and the second is due to the f-transition into s. This satisfies the 
theorem since f= 0. 
Forf3 1, Fig. 5(d) shows all incoming transitions. Any d-transition into s must be 
from a state UEE~_ 1,1, with n(u) = x$ There is one f-transition into s from an interior 
state, and one r-transition. Summing the weighted probabilities of the incoming 
transitions, we get 
Markov analysis of multiple-disk prefetching strategies 231 
Case 3: s~Ef,k, f 21 and 2<k<f: We show that 
n(s)=cr (.I-+ l)!(D-k- l)! 
(f-k+ l)!(D-2)!’ 
Figure 5(e) shows the case of an edge state with 2 d k <J The state s, withf 3 2 and 
2<k<f; has a total of D incoming d-transitions; k from states in Ef_l,k_l, and the 
remaining D-k from states in E,_ l,k. Since y(s) 32, s cannot be reached by an 
f-transition. Since s has k parts equal to 1, there are k incoming r-transitions. 
Denoting by P,, k the steady-state probability of a state in E,,,, the weighted sum of 
the probabilities of transitions into s is [Ps-I,k-I](k/D)+[P~-I,~]((D-k)/D)+ 
[Ps,k](k/D). After substituting for each Ps,k and simplifying, the sum reduces to 
(f+l)!(D-k-l)!/(f-k+l)!(D-2)!, as claimed. 
Case 4: SEE~,~+~. We show that 
n(s+(f+ l)!(D-f-2)! 
(D-2)! 
Figure 5(f) shows an edge state s with k=f+ 1. The incoming transitions are very 
similar to case 3 with k =f+ 1, Fig. 5(e), except hat there are no transitions into s from 
states in E,_ 1,1+ 1, since all states in E,_ l,f+ 1 are unreachable by Lemma 5.1. 
Thus s hasf+ 1 incoming d-transitions from states in E,_ 1,1, andf+ 1 r-transitions. 
The weighted sum of all incoming probabilities is therefore, [Ps_ l,f]$(f+ 1) + 
[Pf,r+I]$(f+ 1). After simplifying and rearranging terms, the sum reduces to 
~(f+ l)!(D-f-2)!/(D-2)!. 0 
We now determine the average I/O parallelism for the deterministic model, by 
evaluating the formula in Lemma 4.5. Recall that rc(s) denoted the probability of state 
s. We first compute two sums which are then used in Theorem 5.11 to determine the 
average I/O parallelism. 
Lemma 5.9. 
in which cx is the normalization constant. 
Proof. For an interior state t, z(t) = cc(D - l), by Theorem 5.8. By Theorem 5.4 there is 
a one-to-one correspondence between I and Y’,(C- D + 1). Hence, by Lemma 3.15, 
zn(t)r(t)=a(D-1) 1 y(t)=cr(D-l)D 
~GY’YD(C-D+I) 
232 VS. Pai et al. 
Lemma 5.10. If C 3 20 - 1 so f-transitions make sense, 
,;n(M)=W)- 1) [(l-D)+(C-D+l) 
x(H(C-D)-H(C-2D+l))] 
in which c1 is the normalization constant and H(n)=C;= I(l/i) is the nth harmonic 
number. 
Proof. By Theorem 5.8, if SEE~,~ then n(s)=a(f+l)!(D-k-l)!/(f-k+l)!(D-2)!. 
By definition, r(s)= k. The number of states in E,,, is given by Lemma 3.8, with 
n = C-S, and m = D. Also, by Theorem 5.4, the range of values of k for a givenfsatisfies 
1 <k,<f+ 1, and the range offis OGf<D-2: 
Note that C-f-D - 1 > 0 since C 3 20 - 1. Equation (10) can be simplified by 
rearranging and combining terms, resulting in 
(11) 
The (D-k) term in (11) can be combined with (“j&!‘;‘) by noting that 
[l/(y+1)l(f;)=Cl/(.u+1)1 (;+‘:I 
ctD(D-1) c D-2 [ f+l ‘2 [(k!l)(C;!;Dj]]’ f=o C-f-D k=l 
(12) 
The inner sum in (12) can be simplified by using Lemma 3.2, with m= - 1, p=f, 
q = C-f-D, n = D. The resulting quantity is 
rD(D-1) c 
I;: [C!;:D [(“D:;)-,j+2 (k!,)(C;!;D)]]. (13) 
By noting that f+ 1 < k - 1 < D - 1 and that consequently k - 1 >f, the second sum in 
(13) has lower index starting at f-t 2 implying that for all choices off the first term 
inside the sum, (kL1)= 0. Hence the second sum is 0 and the resulting expression is 
(14) 
The goal is to express the terms in (14) in terms of H(n)=Cy= 1 (l/i) because there are 
excellent asymptotic approximations to the harmonic numbers [7]. 
Markou analysis cfmultiple-disk prefetching strategies 233 
To simplify (14), the sum will be expanded and rewritten by subtracting 1 and 
adding 1 to each term; for clarity, let x = C-D. The resulting expression is 




Substituting x= C- D into (15) results in 
Y2 f+l s=,, C-D-f 
=(I-D)+(C-D+l)[H(C-D)-H(C-2D+l)]. 
To complete the proof substitute (16) into (14). 0 
(15) 
(16) 
Theorem 5.11. For the deterministic prefetching model, the average number of blocks 




in which H(n)=C;=, (1,“) z an d we assume C> 20 - 1. If C < 20 - 1, no prefetching is 
done and the average parallelism is 1. 
Proof. The formula derived in Lemma 4.5 is directly applied. Note that n(t) = D if&Z, 
and n(t) = 1 if %E. The sum over all the states can be divided into a sum over the 
interior states and a sum over the edge states, resulting in the following expression for 
the numerator: 
D c ~0) n(r) + 1 y(r) r@). 
fEI IGE 
The expression for the denominator is 
.~~;‘(s)x(s)+~~Y(s)n(s). 
The individual sums over the interior and edge states are derived in Lemmas 5.9 and 
5.10, respectively. By substituting for the individual sums and simplifying, the theorem 
is proven. 0 
6. Simulation and comparison of the models 
In this section we simplify the formulae for the average parallelism derived pre- 
viously to obtain the asymptotic performance (for large D) of the two strategies. We 
234 VS. Pai et al. 
also plot a comparison for some small values of D, and compare their performance 
using simulation. 
Random strategy: Rearranging the expression in Theorem 4.6, the average parallel- 
ism is given by 
Note that 
(3 c -=_ 
(2::) D’ 
Case 1: C = c( D, for some fixed constant CY > 2. Then 
Hence, the average parallelism is at least 
s(l-(ii)“)+~ as D-+x. 
Case 2: C = a D2 for some fixed constant x. Then 
Noting that (1 - l/%D)“+e- ‘la as D-t zo, the average parallelism in the limit is 
given by 
r(1 -e - l,‘z)D, 
Case 3: C=C~DI+~, for some fixed constant O<fl< 1. Then, a similar derivation 
can be used to show that the average parallelism for large D is approximately rD”. 
Case 4: C= Dzcp, for some fixed constant p>O. It is easy to show that the average 
parallelism is D-o(D). 
Deterministic strategy: Rewriting equation (5.1 l), the average parallelism is 
1+ 
D-l 
2-D+(C-D-t l)[N(C-D)-N(C-2D+ l)]’ 
In the following we approximate the term [H(C- D)- N(C- 20 + l)] using the 
fact that N(n)=lnn+O(l/n), and ln(1 +z)=z-_1z2+O(z3) [7]. 




Markoc analysis of multiple-disk prefetching strategies 235 
Now 
D-l D-l D2-2D+l 
ED-20-k 1 %D-2D+ 1 - 2((~~-2)~D~+2(a-22)D+l 
Hence, 
Thus, average parallelism, for large D is approximately I+ 2cr. This approximation is 
very good for c( > 7 or 8. For smaller cc, one can use more terms in the expansion for 
ln(1 +z) to get a more precise bound. 



















. , .I. 
8’ 
: . , 
.; Sbula 
_______~edict 









0 loo 200 300 400 500 600 700 800 900 1ooo 
Cache size (blocks) 
Fig. 6. Simulated and predicted average parallelism as a function of the cache size C for the randomized 
model. For 0=5, 12 500 blocks of data are used, and for D= 10. 25000 blocks of data are used. 
236 VS. Pai et al. 
Casr 2: C = L-I D2, for some fixed constant c(. Then 
D-l D-l D2-2D+ 1 
rD2-2D+l Y~D2-2D+1-2(rD2-2D+1)2' 
Using the same method as above to approximate the harmonic difference and 
simplifying, we get 
(C-D+l)[H(C-D)-H(C-2D+l)]-DxD-1 + &. 
Thus, the average parallelism for large D is approximately 
l+ 
D-l D 
1 + l/(24 = 1 + 1/(2X)’ 
Predicted and Simulated Average l/O Parallelism (Determ&ic Model) 
I 
_________________________--------------- 
























,.,/ ,....... . . . 
-~Simulated (D=lO) 
------:Predictcd@Y1(l) 
- - - - -. Simulated (D=5) 
- Predicted @=S) I 
0 100 200 300 400 500 600 700 800 900 loo0 
Cache size (blocks) 
Fig. 7. Simulated and predicted average parallelism as a function of the cache size C for the deterministic 
model. For D= 5. 12 500 blocks of data are used, and for D = 10, 25000 blocks of data are used. 










I , I 






/ ,,: _,... “’ 
r’ _..’ 
/’ ,..” 
. . . . . . . . .._.._.._......_...... . . ./.... 
/ 
,.,...... .::::‘c &)(j.;. 
:; ,,: 
: 
,: :’ : 
..,: 
5 6 
Number of Disks 
10 
Fig. 8. Predicted average parallelism as a function of D for the randomized model. 
Case 3: C = ct D' +p for some fixed constant 0 </I < 1. Working as before, one can 
show that the averag; parallelism for large D is approximately 2aDa. 
Case 4: C= D2+p for some fixed constant jI>O. The average parallelism can be 
shown to be D-o(d). 
Comparing the expressions for the randomized and deterministic ase, the asymp- 
totic parallelism for large D is generally twice as large as that for the randomized 
strategy, for moderate cache sizes and is still noticeably better when C=O(D'). 
A simulator was developed to execute the program model described in Fig. 1. For 
a given value of D, the average parallelism was evaluated for various values of C. Each 
experiment involved 30 trials, and the results of these trials were averaged to produce 
a plot of the average parallelism. 
Figures 6 and 7 show the simulation results with 12 500 blocks of data for D = 5 and 
25 000 blocks of data for D = 10. The steady-state analysis is a good predictor of the 
average I/O parallelism even with nominal-sized data. Figure 8 shows the average 
238 VS. Pai et al 
parallelism predicted by Theorem 4.6 as a function of both C and D. It is interesting to 
note that for small cache sizes the average I/O parallelism drops as the number of 
disks increases past a limit. As the cache size increases, the parallelism approaches the 
number of disks. The deterministic model shows a similar trend as well. 
7. Conclusions 
Choosing a prefetching strategy that maximizes I/O performance is, in general, 
a difficult task [lo]. In this study, we investigated the performance of two prefetching 
strategies that can be used with external mergesort. To compare the cache require- 
ments of the two strategies, a Markov model was developed for each strategy. 
Closed-form expressions were derived for the average I/O parallelism in each model, 
and the results were confirmed by simulation. For block-random data high disk 
concurrency can be obtained with a reasonable amount of disk cache using either of 
the two strategies we studied. 
From a practical viewpoint, the two strategies have very similar performance. 
Intuitively, this is because for a large enough cache, both strategies are usually in 
a state where it is possible to prefetch from all disks. From a theoretical viewpoint, it is 
surprising that a greedy strategy which maximizes the concurrency on every access 
actually performs worse than the deterministic strategy. Intuitively, this happens 
because the deterministic strategy is better able to stay away from the boundary states 
where the cache is full by leaving some cache slots empty sometimes. As a result, the 
deterministic strategy can prefetch from all D disks more often than the greedy 
strategy, and those extra full prefetches more than make up for the partial prefetches 
that the greedy strategy does near the boundary. 
Acknowledgment 
We thank the referees for their insightful comments. 
References 
[l] A. Aggarwal and J.S. Vitter, The input/output complexity of sorting and related problems, Comm. 
ACM 31(9) (1988) 1116-1127. 
[2] R. Agrawal, S. Dar and H.V. Jagadish, Direct efficient transitive closure algorithms: design and 
performance evaluation, ACM Trans. Database Systems E(3) (1990) 427-458. 
[3] GE. Andrews, The Theory of Partitions (Addison-Wesley, Reading, MA, 1976). 
[4] T. Cormen, Fast permuting on disk arrays, J. Parallel Distributed Computing 17 (1993) 41-57. 
[S] G.A. Gibson, Performance and reliability in redundant arrays to inexpensive disks, in: Proc. Internat. 
Cot$ on Managemenr and Performance Evaluation of Computer Systems (CMG ‘89) (1989) 38 I-391. 
[6] G.A. Gibson, L. Hellerstein, R.M. Karp, R.H. Katz and D.A. Patterson, Coding techniques for large 
disk arrays, in: Proc. 3rd Internat. Conf: on Architectural Support for Programming Languages and 
Operating Systems (ASPLOS 111) (1989) 123-132. 
Markov analysis of multiple-disk prefetching strategies 239 
[7] R.L. Graham, D.E. Knuth and 0. Patashnik, Concrete Mathematics (Addison-Wesley, Reading, MA, 
1989). 
[S] J.-W. Hong and H.T. Kung, I/O complexity: the red-blue pebble game, in: Proc. 13th ACM Symp. on 
Theory of Computing (1981) 326-333. 
[9] M.Y. Kim, Synchronized disk interleaving, IEEE Trans. Comput. C-35(11) (1986) 978-988. 
[lo] D.F. Kotz and C.S. Ellis, Prefetching in file systems for MIMD multiprocessors, IEEE Trans. Parallel 
Distributed Computing l(2) (1990) 218-230. 
[ll] S.C. Kwan and J.L. Baer, The I/O performance of multiway mergesort and tag sort, IEEE Trans. 
Comput. 34(4) (1985) 383-387. 
1121 M. Livny. S. Khoshafian and H. Boral, Multi-disk management algorithms, in: Proc. ACM Sigmetrics 
Con& on Measurement and Modeling on Computer Systems (1987) 69-77. 
[13] M.K. Molloy, Fundamenta/s of Performance Modeling (Macmillan, New York, NY, 1989). 
1141 M.H. Nodine and J.S. Vitter, Optimal deterministic sorting in large-scale parallel memories, in: Proc. 
1993 ACM Symp. on Parallel Alogrithms and Architectures (1993). 
[15] V.S. Pai and P.J. Varman, Prefetching with multiple disks for external mergesort: simulation and 
analysis, in: 8th Internat. Conf: qf Database Engineering (1992) 273-282. 
[16] D.A. Patterson, G. Gibson and R.H. Katz, A case for redundant arrays of inexpensive disks (RAID), 
in: Proc. ACM SlGMOD Internat. Co@ on Management ofData (1988) 109-116. 
[17] A.L.N. Reddy and P. Banerjee, An evaluation of multiple-disk I/O systems, IEEE Trans. Comput. 
38(12) (1989) 1680-1690. 
[18] A.L.N. Reddy and P. Banerjee, Design, analysis, and simulation of I/O architectures for hypercube 
multiprocessors, IEEE Trans. Parallel Distributed Computing l(2) (1990) 140-151. 
1191 K. Salem and H. Garcia-Molina, Disk striping, in: Proc. 2nd IEEE Internat. Conf: on Data Engineer- 
ing (1986) 336-342. 
[20] B. Salzberg, Merging sorted runs using large main memory, Acta lnformatica 27(3) (1989) 195-215. 
[21] J.D. Ullman and M. Yannakakis, The input/output complexity of transitive closure, in: Proc. ACM 
SIGMOD Inrernar. Conf: on Management of Data (1990) 44-53. 
[22] J.S. Vitter and M.H. Nodine, Large-scale sorting in uniform memory hierarchies, J. Parallel Distrib- 
uted Computing 17 (1993) 107-114. 
[23] J.S. Vitter and E.A.M. Shriver, Optimal disk I/O with parallel block transfer, in: Proc. 22nd ACM 
Symp. on Theory of Computing (1990) 159-169. 
[24] L.Q. Zheng and P.-A. Larson, Speeding up external mergesort, Tech. Report CS-92-40, Dept. of 
Computer Science, Univ. of Waterloo, August 1992. 
