Abstract. We present several e cient algorithms for sorting on the uniform memory hierarchy UMH, introduced by Alpern, Carter, and Feig, and its parallelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we i n troduce called RUMH and P-RUMH, which more closely correspond to current programming languages.
Introduction
In many large-scale computer systems, memory progresses from very small but very fast registers to successively larger but slower components, such as several layers of cache, primary memory, disks, and archival storage. In order to achieve optimal 1 INTRODUCTION performance on such a computer, it is often necessary for the algorithm designer to take i n to account the physical characteristics of the memory hierarchy. Unfortunately, there are too many possible variables to consider e.g., the block size of each level, the number of blocks at each level, the bandwidth between one level and the next to allow the design of general algorithms; hence some degree of abstraction of the memory hierarchy is required.
Several interesting and elegant hierarchical memory models have been proposed recently to model the many levels of memory typically found in large-scale computer systems. The HMM model of Aggarwal, Alpern, Chandra, and Snir AAC allows access to individual location x in time fx. The BT model of Aggarwal, Chandra, and Snir ACSa represents a notion of block transfer applied to HMM; in the BT model, access to the t+1 records at locations x,t, x,t+ 1 ,. . . ,xtakes time fx + t . T ypical access cost functions are fx = log x and fx = x , for some 0.
1
A model similar to the BT model that allows pipelined access to memory in Olog n time was developed independently by Luccio and Pagli LuP . Optimal sorting algorithms for each of these models have been developed AAC, ACSa, LuP .
In this paper we concentrate on a newer hierarchical memory model introduced by Alpern, Carter, and Feig ACF, ACSb , called the uniform memory hierarchy UMH, which o ers an alternative model of blocked multilevel memories. In the UMH b` model for integer constants ; 2, the`th memory level as illustrated in Figure 1 consists of `b locks, each of size `; it is connected via buses to levels , 1 and`+ 1 . Each individual block on level`can be randomly accessed as a unit and transferred to or from level`+ 1 at a bandwidth of b`; that is, each block transfer takes time `= b`. The CPU resides at level 0.
A model for parallel hierarchies was introduced by Vitter and Shriver, in which P hierarchies are connected at their base level via an interconnection network as shown in Figure 2 . Communication between the P hierarchies takes place at the base memory level call it level 0, which consists of location 1 from each of the P hierarchies. The P base memory level locations are interconnected via a network such as the hypercube or cube-connected cycles so that the P records in the base memory level can be sorted It is connected via buses to levels`, 1 and`+ 1. Each level`block can be randomly accessed and transferred to level`+ 1 at a bandwidth of b` that is, in `= b` time.
in Olog P time perhaps via a randomized algorithm ReV . Vitter and Shriver introduced optimal randomized sorting algorithms for P-HMM and P-BT ViSa . The algorithms were based on their randomized two-level partitioning technique applied to the optimal single-hierarchy algorithms for HMM and BT developed in AAC, ACSa .
We can consider parallel UMH hierarchies analogous to P-HMM and P-BT, and we call the resulting model P-UMH. This is fundamentally di erent from the parallel type of UMH called UPHM mentioned in ACF . The initial input of N elements resides at level s = d constant factor must be 1. If the UMH program runs in time OT N, it is said to be e cient. A UMH program whose running time is within a constant factor of best possible for that problem in the UMH model is said to be optimal. In Section 2 we give optimal and near-optimal sorting algorithms for UMH and P-UMH for a wide range of bandwidth rates b`, and we present a parsimonious schedule for merge sort for the case b` = 1. In Section 3 we also introduce a natural and easy-to-program restriction of UMH, called random-access UMH or RUMH, for which w e h a v e optimal upper and lower bounds for all bandwidths and amounts of parallelism, and a sequential model of UMH called SUMH, for which w e do the same. Optimal sorting in ON log N time in UMH is possible only when the bandwidth b` at level`is 1=`, or else the time required just to access the N records will be greater than ON log N. Many buses may be active simultaneously in the UMH model, so conceivably it is possible to sort in ON log N time even with small bandwidth b` = 1 = + 1. Recently other authors announced an e cient UMH sorting algorithm for the case b` = 1 = + 1, based on the optimal two-level distribution sort algorithm of AgV , but their UMH 1=`+1 algorithm turned out to be ine cient, with a running time of N log c N, for c 3. Whether or not an ON log N-time UMH 1=`+1 algorithm exists is still open.
In this section we give a near-optimal sorting algorithm for the small bandwidth case b` = 1 = +1, and optimal sorting algorithms for several other bandwidths. For the special case of constant bandwidth, we present a parsimonious algorithm. Since optimal sorting seems to require nonoblivious UMH programs, the oblivious UMH model of ACF must be modi ed in a reasonable way. In Theorem 1, we assume that the`th level of the hierarchy can initiate a transfer from the `+ 1st level without involving the CPU when one of its blocks becomes empty. In the remaining theorems, we assume that the CPU can originate the transfer of a block at level`given the address of the block, with suitable delay.
The fastest oblivious algorithm we h a v e found for sorting in UMH 1=`+1 is based on a simple schedule of Batcher's bitonic sort Akl where each of the log 2 N parallel time steps is implemented i n O N log N time for an overall running time of ON log 2.1 Parsimonious sorting in UMH 1 Theorem 1 A variant of merge sort can be scheduled i n U M H 1 p arsimoniously, assuming 2 and 6.
Proof : The basic idea is to schedule a systolic binary merge sort in such a w a y that the CPU is always kept busy after a small initial delay and with a small nal delay for propagating the results back. After the initial delay, the CPU level 0 reads one element from each of the two lists. At e v ery time step after the initial delay, the CPU writes the smaller element to the output list and then reads the next element from the list that had the smaller element at the previous step. We use a double-bu ering scheme so that level`, for` 1, contains room for two blocks from each o f t h e t w o lists being merged. It also has two blocks for the output list. When level`, 1 requests a subblock from one of the lists, and this request causes level`'s bu er to be emptied, then level`requests the next block from level`+ 1. In this way, levelà lways has at least one sub-block for level`, 1 a v ailable on demand. The output blocks ll up at a known rate, so they can be scheduled in advance again using double-bu ering to keep an empty subblock a v ailable for writing from level`, 1. At the end of each list, we immediately begin to send a new list down for the next merge. The CPU can keep track o f h o w many elements have been read from each list, so that when one list is nished it knows to copy the rest of the other list to the output. The number of wasted CPU cycles is only Olog N = o N log N, so the schedule is parsimonious. Proof : The lower bound for the rst case b` = 1 follows from the conventional N log N serial bound for sorting on a RAM. With P processors, the P-UMH sorting time can be at most P times faster, giving a N=P log N l o w er bound. The other simulations presented in this paper are a bit trickier, since they require that e ective use be made of blocking in the UMH simulation, and therefore that the algorithm meet certain constraints. A su cient constraint is that operations in the algorithm process all the elements at any level in the hierarchy consecutively. I t m a y be convenient for the algorithm to describe the elements as comprising groups that may be unrelated to the block size for that level, but as long as all the elements are accessed consecutively, the intermediate levels of the hierarchy can act as bu ers to allow reblocking to occur as needed without losing e ciency, as long as 3. The algorithms given in ViSb all meet this constraint.
For the second case b` = 1 = + 1, the upper bound is related to the P-HMM approach for fx = log x ViSb . The P-HMM algorithm needs to be modi ed to reblock the buckets prior to sorting them recursively by substituting Step 8 of the P-BT algorithm of ViSb into the P-HMM algorithm. The cost of accessing an element at location x in the HMM model is log x; the amortized cost of accessing the same element in the UMH model when an entire block is brought to the base memory level is logx= = log , which is within a constant factor of the HMM cost.
The upper bound third case b` = , c`m akes use of an algorithm based on deterministic, two-way merge sort. This algorithm gives rise to the recurrence relation which gives the stated bound.
2
The algorithms are optimal, except for the middle b` = 1 = + 1 case, which is o from the best known lower bound of N=P log N b y a loglog N= log P factor. The UMH model can be di cult to program because many buses can be active simultaneously. An earlier version of ACF introduced a sequential UMH model, appropriately called SUMH, that allowed at most one bus to be active at a time. However, the SUMH restriction can be regarded as too severe, since it forfeits much p o w er of the UMH model. We i n troduce the following more natural and less severe restriction that ts in nicely with feasible and easy-to-use programming languages: We require that the UMH program correspond exactly to a RAM program in which the RAM instruction set is augmented with a block m o v e command that can move t contiguous memory elements in time t, for arbitrary t. Each such block transfer can be implemented i n UMH by a coordinated series of transfers in which several buses are simultaneously active but cooperating on that single transfer. We call this natural variant of UMH the random-access UMH model, or simply RUMH. The parallel versions of RUMH and SUMH are called P-RUMH and P-SUMH, respectively. Theorems 3 and 4 give matching upper and lower bounds for sorting in the RUMH and SUMH models and their parallelizations. The structures of the formulas in Theorems 3 and 4 suggest several di erent relationships between the RUMH and SUMH models on the one hand and the HMM, BT, and two-level models on the other hand cf. Theorems 5 and 6 in ViSa ; accordingly the upper and lower bounds combine in an interesting way several techniques from AAC, ACSa, AgV, ViSa .
Theorem 3 The running times mentioned i n T h e ore m 2 a r e matching upper and lower bounds for sorting in P-RUMH. The algorithms for nonconstant P for the rst two bandwidth cases are r andomized.
Proof : The upper bounds all follow directly from the proof of Theorem 2, since all the algorithms given there are P-RUMH algorithms.
The lower bounds for b` = 1 a n d b = , c`a re the same as and follow directly from those for P-UMH. When b` = 1 = + 1, we can prove a tight l o w er bound by simulating RUMH 1=`+1 by HMM with access cost function fx = log x. Speci cally, any block transfer of `,1 elements from level`to level`0 where` 0 will take a n amount of time that is `,1 `, 2 , 1 , 1 , `0 `0 , 1 , 1 , 1 , so the simulation by HMM is bounded by a constant times the RUMH 1=`+1 running time. Hence, the lower bound for P-HMM for fx = log x given in ViSb also holds for P-RUMH 1=`+1 . Proof : W e prove the lower bounds using an approach similar to that of ViSb . Let us de ne the sequential time" of a P-SUMH algorithm to be the sum of its time costs for each of the P hierarchies. The sequential time can be at most P times the P-SUMH running time. We superimpose on the P-SUMH model a sequence of onedisk, two-level memories of the type studied in AgV, ViSb , in the following way: For 1 ` 1 2 log N= P , the`th two-level memory has one disk, internal memory size M`= P We get the desired lower bound on the P-SUMH time by substituting the values of M`, B`, and C`for the three cases into the above summation, and then dividing by P . The b` = , c`c ase in addition requires the use of the the conventional N log N serial bound for sorting. The upper bounds for the rst two cases b` = 1 a n d b = 1 = +1 are achieved by simulating the optimal P-HMM algorithm of ViSb , for access cost functions fx = log x and fx = log 2 x, respectively. Since a UMH in each case can simulate an HMM with the appropriate cost function in a running time that is at most a constant times the HMM time, the P-HMM bound holds for the P-UMH simulation. The upper bound for the b` = , c`c ase is achieved by the same deterministic merge sort as the previous theorems.
4 Conclusions
We h a v e given optimal or near-optimal sorting algorithms for UMH and its parallelization that we h a v e i n troduced called P-UMH. We h a v e derived tight matching upper and lower bounds for sorting in the restricted models RUMH and SUMH and their parallelizations. Some of the algorithms are randomized. The RUMH model is particularly useful because it is easy to visualize and it matches well with current programming languages and compilers. An interesting open problem is whether it is possible to sort in ON log N time with the UMH 1=`+1 model. The related FFT computation can be done in UMH 1=`+1 in ON log N time. Another open problem is whether a parsimonious oblivious algorithm can be found to replace our non-oblivious one in UMH 1 , or whether deterministic algorithms can be found to replace the randomized ones.
