Multiple-disk organizations can be used to improve the I/O performance of problems like external merging. Concurrency can be introduced by overlapping I/O requests at di erent disks and by prefetching additional blocks on each I/O operation. To support this prefetching, a memory cache is required.
Introduction
Advances in processor architecture and integration technology have resulted in steady increases in processor speeds over the past several years. The performance of I/O subsystems, in contrast, has generally not kept pace with these improvements in processor performance. The data rates possible from single disks are limited by physical considerations such as the speed of disk rotation and the rate of head movement, and are unlikely to increase dramatically. As a consequence, there have been a number of recent proposals for the use of multiple disks to form high-performance I/O subsystems 9, 16, 18] . Performance evaluation of di erent multiple-disk systems, and associated management strategies have been studied in 19, 12, 17, 5, 6, 10] , for example. A number of analytic studies of I/O performance for speci c computational problems have been undertaken previously in 11, 20, 8, 1, 21, 2, 23, 22, 14, 4] . This paper suggests an analytic framework for the study of I/O parallelism by undertaking a speci c case study of prefetching as a means for improving I/O performance in a multiple-disk environment. In particular, we study the tradeo between the average disk parallelism and the cache size for a single-pass merge of D sorted runs using D concurrent disk units storing the input. The system model consists of an in nitely fast CPU, a RAM-based disk cache with a capacity of C blocks, and D disks containing one sorted run each. On a single I/O operation, at most D blocks, one from each disk, can be read and transferred to the cache concurrently.
The D-way merge algorithm operates as follows. A block from each run is brought into memory, and the records from each block are extracted and merged together in sorted order. When one of the input blocks is depleted, an I/O request is made for the next block from the run whose block was depleted. In a multiple disk system, the request for this next block can be overlapped with prefetching a block from each of the other disks. These prefetched blocks are held in the cache until they are required by the merge. If the cache is large enough to hold all prefetched blocks, then the total number of I/O accesses will be reduced by a factor of D over the case of a single disk, and a speedup of D in the I/O time will be expected 1 .
We use the block random depletion model 11] in which the probability that the next block depleted comes from any particular run is uniformly 1=D. A worst-case scenario of block depletion, in which all the blocks from run 1 are depleted, then all the blocks from run 2 are depleted , and so on, is not interesting from a prefetching point of view. Once the cache gets lled in the worstcase scenario, only the depleted block can be replaced, and there is no opportunity for parallelism. 1 The seek time reduction that arises when the runs are spread over several disks are not considered.
Therefore, we sought a random model of block depletion that would be useful to study average-case behavior, and we chose a model proposed previously in a related paper.
Ideally, we would like the prefetching strategy to read D blocks on each input operation; however, in practice, the cache can become lled, and it may not have space to store D additional blocks. We would like to know the expected number of blocks fetched on an input operation, which we can think of as the average parallelism obtained by using a machine with D disks instead of 1 disk. In particular, we analyze the relationship between the size of the disk cache and the average parallelism for two di erent prefetching strategies that can be easily implemented with a standard multi-way merge algorithm.
The motivation for the study is to provide a quantitative understanding of some of the tradeo s involved when resources like main memory are allocated to tasks in a typical DBMS system. We therefore concentrate on analyzing simple enhancements of the standard multiway merge algorithm that is commonly employed in such systems. The speeds of the source of the input data and the destination of the sorted output are assumed to be fast enough that merging is the bottleneck. Such situations arise often in practice, as when the input data is available on multiple disks, and the output is fed to a further processing stage. The larger questions of whether algorithms other than the multiway merge may be more suitable for sorting or merging data are not addressed here. With the same perspective, we assume contiguous placement of the sorted runs on the disks, as is done in several commercial DBMS systems like IBMs DB2, SQL/DS for instance. In contiguous placement, a run is placed on a single disk, and occupies consecutive blocks on that disk. In contrast, in block-interleaved placement, consecutive blocks of a run are interleaved across the D disks. Issues related to block interleaving, including the degree of hardware support provided, the possibility of partial striping and the granularity are still the subject of active research, and beyond the scope of this paper.
Recent investigations of di erent methods for multiple-disk sorting include the works of Vitter and Shriver 23], and Nodine and Vitter 22, 14] , who have presented new algorithms for externalsorting. Vitter and Shriver 23] have shown that a randomized version of distribution sorting can achieve an asymptotically optimal number of I/Os. Vitter and Nodine have discovered an ingenious deterministic, mergesort-like algorithm called Greed Sort. This algorithm achieves optimal asymptotic performance even for small cache sizes, but requires a larger constant in the number of I/Os required in any pass. More recently Nodine and Vitter 14] have discovered a deterministic version of distribution sort called Balance Sort, which reduces the constants associated with Greed Sort. While quite promising, these algorithms need to be evaluated to determine the ranges over which they are superior to mergesort especially in real systems. For other works on the I/O complexity of sorting and merging, the reader is referred to the works of Kwan and Baer 11] and Salzberg 20] for single-disk systems, and to 15, 24] for multi-disk systems.
The two prefetching strategies are described in detail in Section 2. The rst is a randomized (and greedy) algorithm that attempts to read one block each from as many disks as is possible, subject to the availability of free cache space. The second is a deterministic (and conservative) algorithm that reads concurrently only if all disks can be used; otherwise it does not perform any prefetching. In our analysis, we derive a closed-form expression for the average number of blocks read on an I/O operation as a function of the cache size and the number of runs. Surprisingly, the conservative strategy, which sometimes leaves available cache slots empty, attains slightly more parallelism than the greedy strategy for most reasonable choices of the number of disks and cache size. This theoretical result is con rmed by our simulations using nominal-sized data.
The rest of this paper is organized as follows. In Section 2 the two prefetching strategies are described in detail, and Markov models are developed for their analysis. Section 3 contains some mathematical de nitions and combinatorial results used in the analysis. The rst prefetching strategy is analyzed in Section 4 and the second in Section 5. In Section 6 we present simulation results which validate the analysis, and we conclude with Section 7.
System Models
In this section we describe two prefetching strategies that can be used to improve the I/O performance in implementing the multiway merge. For both these strategies we develop Markov models which are analyzed in later sections.
During the merge, whenever a block is depleted from run i, the cache is checked to see if the next block from run i in the cache. If no block from run i is in the cache, a fetch operation is required.
The block needed will be referred to as the demand-fetch block. To exploit the parallelism possible with the D disks, blocks from other disks can be prefetched along with the demand-fetch block, provided su cient space is available in the cache to accommodate them.
The two prefetching strategies we analyze will be referred to as the randomized prefetch model and the deterministic prefetch model, respectively. Both models have identical behavior when su cient cache memory is available to fetch a block from each disk. In this case, D ? 1 blocks are prefetched along with the demand-fetch block. The models di er when the number of free cache blocks, F, is less than D ? 1. In the randomized model, F of the D ? 1 disks are randomly chosen with equal probability; the demand-fetch block is fetched along with one block from each of these F disks. In the deterministic model, no prefetching is performed; only the demand-fetch block is fetched. The deterministic model fetches either D blocks or only 1 block, whereas the randomized model fetches between 1 and D blocks, depending on the amount of cache space available.
An I/O operation that fetches at most D blocks, one from each disk, will be charged unit cost. The D disks can independently fetch blocks from di erent disk locations, but they will be assumed to initiate and complete I/O operations as a group. This assumption relaxes any temporal transients without invalidating the cache behavior being studied. If the full parallelism of the D disks can be used on every I/O operation, the cost of the merge operation will be reduced by a factor of D over the single-disk case. Ideally, a speedup of D can be obtained with a cache of unbounded size. As the cache size is reduced, the actual speedup will be lower, depending on the average I/O parallelism of each I/O operation.
The two prefetching models are described by the pseudocode in Figure 2 .1. Initially, the cache is loaded with one block from each run. At the start of any iteration of the loop, at least one block from each run will be present in the cache. A random model of block depletion is assumed 11]. The leading block from each run is a candidate for being depleted next. One of the D leading blocks is chosen with equal probability 1=D, and depleted. If that run still has blocks present in the cache, no I/O operation is required (we call this a d-transition ). However, if that run has no more cached blocks, an I/O operation will be required to retrieve the next block from that run before the merge can continue. If there is enough cache space to read a block from each disk, D blocks are fetched (called an f-transition); if not, the action taken is di erent for the two prefetching models. In the deterministic model only one block from the depleted run is fetched (called an r-transition), while in the randomized model enough blocks to ll up the cache are fetched (called a p-transition).
The dynamic behavior of both systems can be modeled mathematically by Markov chains. Each state in the model will represent the number of cache blocks allocated to each run at the start of each iteration of the loop in Each execution of the loop of the program causes a transition to a di erent state.
To describe the states of the system and the transitions, we use the following notation.
De nition 2. Every state has exactly D possible depletions that initiate a transition to another state. Each depletion is equiprobable with probability 1=D. In most cases, the depletion determines which transition is made, and the transition also has probability 1=D. The only exception is when a p-transition occurs in the randomized model. Most states and transitions look the same in both Markov chains. For example, in both Markov chains, the state 1; 2; 2] has three outgoing transitions, each with probability 1=3. If a block is depleted from the second or third runs, a d-transition is made to 1; 1; 2] or 1; 2; 1], respectively. If a block is depleted from the rst run, an I/O operation is required, since the rst run has no cached blocks. Since 1; 2; 2] has 2 free cache blocks, an f-transition will be made, leading to state 1; 3; 3]. The di erence between the chains can be seen by considering the state 3; 2; 1]: In the randomized model 3; 2; 1] has four outgoing transitions, while in the deterministic model it has three. If the leading block from the third run is depleted, the models di er. In the randomized model, two p-transitions are possible, to 4,2,1] or to 3,3,1]. The choice of p-transition depends on a random decision on whether to use the last free block in the cache for a block from the rst run or the second. Each occurs with equal probability (1/2), so the probability of the two p-transitions is (1=3)(1=2) = 1=6 for each. In the deterministic model, the depletion of the lone block from the third run forces an r-transition back to 3,2,1]; this r-transition has probability 1/3.
Mathematical Preliminaries
In this section, some known combinatorial results about binomial coe cients and integer partititions are summarized. These results are used, in turn, to derive related mathematical formulas that will be applied directly in the analysis of Sections 4 and 5. (The reader may prefer to skim through this section and return to it when the results are used in subsequent sections.) Lemma 3. if it is irreducible, recurrent nonnull, and aperiodic. If a Markov chain is ergodic, there exists a unique limiting distribution for the probability of being in a state k denoted as (k), independent of the initial state. These probabilities are called the steady-state or equilibrium probabilities.
To nd the steady-state probabilities we can solve the steady-state equation M = ; in which is the vector of steady-state probabilities and M is the single-step transition probability matrix.
Analysis of the Randomized Prefetching Model
In this section we analyze the randomized prefetching model, with the aim of deriving an expression for the average parallelism of an I/O operation in the steady state. Towards this end, the state space is precisely characterized in Lemma 4.1, and the Markov chain is shown to be ergodic in Proof: By Lemma 4.1 and De nitions 3.13 and 3.10, the set of states is in one-to-one correspondence with the set D (C). Therefore, the number of states is given by Lemma 3.14. Each state s has a nite non-zero probability of returning to itself, since there is a nitelength, non-empty path from s to itself, and all state transitions have nite, non-zero probabilities.
Therefore, the Markov chain is recurrent nonnull.
To show that the Markov chain is aperiodic, note that there is a path from any state s to itself that passes through a state with no free blocks. In a state with no free blocks, any number of p-transitions can be taken without changing state. As a result there is a constant k such that for any j > k, there is a path from s:
Since the Markov chain is ergodic, Theorem 3.17 asserts that each state has a unique steadystate probability. Surprisingly, every state has the same probability. by a p-transition or an f-transition depends on the occurrence of two events. First, the component of s 0 whose depletion caused the transition must be one of the D?k parts of s which is 1. Secondly, the j blocks that are prefetched must add to those parts i for which s 0 i = s i ? 1. Since the next depletion is chosen with equal probability from the D parts, the probability of the rst event is (D ? k)=D. The probability of the second event given the choice of the depleted component is Proof: An I/O transition from a state t occurs if and only if a component equal to 1 is depleted. Since t has (t) parts that are 1 and each of the D parts is depleted with probability 1=D, the probability of an I/O transition from t is the product of (t)=D and the probability of being in state t (i.e. (t)). Therefore, the conditional probability that an I/O transition is taken from t, given that some I/O transition is taken is (t) (t)= P s (s) (s). Since the number of blocks fetched on an I/O transition from t is n(t), the lemma follows. A closed form in terms of binomial coe cients will be derived for (4); rst, a closed form will be derived for the denominator. By Lemma 4.1 and de nitions 3.13 and 3.9, the denominator is the total number of parts equal to 1 in the set D (C). By Lemma 3.15, this quantity is:
To nd a closed form for the numerator, the number of parts equal to 1 must be summed over all states with C ? j free blocks. By Lemma 3. 
Finally, Lemma 3.1 can be applied to simplify each sum in (7), resulting in:
By combining the denominator, (5), and the numerator, (8), the theorem is proved.
Analysis of the Deterministic Model
In this section the Markov chain for the deterministic prefetching model is analyzed. Once again, an expression will be derived for the average parallelism of an I/O operation in the steady state. Now assume inductively that the lemma holds for all states w where (w) < (s). As shown above, any state t with a transition to s satis es (t) (t) + 2, and (t) < (s). Hence, t satis es the induction hypothesis, and so t is unreachable. Hence, s is unreachable. Therefore, the lemma holds for all w, (w) (s). Since k = (s) (s) + 1 we have (s) k ? 1. We also have the following bound on (c).
Inequality (9) De nition 5.6 Let I denote the set of all interior states and E denote the set of all edge states in the Markov chain. Let E f;k ; 0 f D ? 2; 1 k f + 1 denote the subset of edge states that have f free cache blocks and have k parts which are 1.
Before we derive the steady-state probabilities of the states in the Markov chain, we establish that they exist and are unique. This can be done by showing that the Markov chain is ergodic.
Lemma 5.7 Each valid state s has a unique steady-state probability.
Proof: Theorem 5.4 de nes the set of states in the Markov chain for the deterministic model.
The proof of ergodicity closely parallels the proof of Lemma 4.3.
Let s and t be two arbitrary states. To nd a path from s to t, use d-transitions to get from s to the initial state 1; 1; : : :1], and then use a path from the initial state to t, which exisits by Lemma 5.3. Therefore, the Markov chain is irreducible.
Each state s has a nite non-zero probability of returning to itself, since there is a nitelength, non-empty path from s to itself, and all state transitions have nite, non-zero probabilities.
To show that the Markov chain is aperiodic, note that there is a path from a state s to itself that passes through an edge state. In an edge state, any number of d-transitions can be made without changing state. As a result, the Markov chain is aperiodic.
Hence, the Markov chain is ergodic. By Theorem 3.17, every state has a unique steady-state probability.
We derive below the relative steady-state probabilities of the states, rather than the absolute probabilities. Each relative probability must be scaled by a normalization constant, , to obtain the actual steady-state probability of that state. can be chosen uniquely so that the sum of the probabilities of all states is 1, but we do not need to compute it explicitly.
Theorem 5.8 The steady-state probability of state s, (denoted by (s)), is given by:
in which is the normalization constant.
Proof: The proof will consist of verifying that the probabilities stated in the theorem satisfy the equation M = , in which is the vector of steady-state probabilities in the theorem and M is the single-step transition probability matrix given by the transition rules de ned earlier. In particular, the probabilities of all incoming transitions to s weighted by the steady-state probability of the originating state will be summed and will be shown to equal (s).
The proof consists of four cases depending on the structure of state s. The goal is to express the terms in (14) in terms of H(n) = P n i=1 (1=i) because there are excellent asymptotic approximations to the Harmonic numbers 7].
To simplify (14) , the sum will be expanded and rewritten by subtacting Proof: The formula derived in Lemma 4.5 is directly applied. Note that n(t) = D if t 2 I, and n(t) = 1 if t 2 E. The sum over all the states can be divided into a sum over the interior states and a sum over the edge states, resulting in the following expression for the numerator:
The expression for the denominator is :
The individual sums over the interior and edge states are derived in Lemmas 5.9 and 5.10 respectively. By substituting for the individual sums and simplifying, the theorem is proven.
Simulation and Comparison of the Models
In this section we simplify the formulae for the average parallelism derived previously to obtain the asymptotic performance (for large D) of the two strategies. We also plot a comparison for some small values of D, and compare their performance using simulation.
Random Strategy
Rearranging the expression in Theorem 4.6, the average parallelism is given by: Comparing the expressions for the randomized and deterministic case, the asymptotic parallelism for large D is generally twice as large as that for the randomized strategy, for moderate cache sizes and is still noticeably better when C = O(D 2 ).
A simulator was developed to execute the program model described in It is interesting to note that for small cache sizes the average I/O parallelism drops as the number of disks increases past a limit. As the cache size increases, the parallelism approaches the number of disks. The deterministic model shows a similar trend as well.
Conclusions
Choosing a prefetching strategy that maximizes I/O performance is, in general, a di cult task 10]. In this study, we investigated the performance of two prefetching strategies that can be used with external mergesort. To compare the cache requirements of the two strategies, a Markov model was developed for each strategy. Closed-form expressions were derived for the average I/O parallelism in each model, and the results were con rmed by simulation. For block-random data high disk concurrency can be obtained with a reasonable amount of disk cache using either of the two strategies we studied.
From a practical viewpoint, the two strategies have very similar performance. Intuitively, this is because for a large enough cache, both strategies are usually in a state where it is possible to prefetch from all disks. From a theoretical viewpoint, it is surprising that a greedy strategy which maximizes the concurrency on every access actually performs worse than the deterministic strategy. Intuitively, this happens because the deterministic strategy is better able to stay away from the boundary states where the cache is full by leaving some cache slots empty sometimes. As a result, the deterministic strategy can prefetch from all D disks more often than the greedy strategy, and those extra full prefetches more than make up for the partial prefetches that the greedy strategy does near the boundary. 
