Sorting on Clusters of SMPs by Helman, David R. & JaJa, Joseph
Sorting on Clusters of SMPsDavid R. Helman Joseph JaJaInstitute for Advanced Computer Studies &Department of Electrical Engineering,University of Maryland, College Park, MD 20742fhelman, joseph g@umiacs.umd.eduAugust 28, 1997AbstractClusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for largescale multiprocessor systems. In this paper, we introduce an ecient sorting algorithm for clustersof SMPs. This algorithm relies on a novel scheme for stably sorting on a single SMP coupled withbalanced regular communication on the cluster. Our SMP algorithm seems to be asymptoticallyfaster than any of the published algorithms we are aware of. The algorithms were implemented inC using Posix Threads and the SIMPLE library of communication primitives and run on a clusterof DEC AlphaServer 2100A systems. Our experimental results verify the scalability and eciency ofour proposed solution and illustrate the importance of considering both memory hierarchy and theoverhead of shifting to multiple nodes.Keywords: Parallel Algorithms, Generalized Sorting, Integer Sorting, Sorting by Regular Sam-pling, Parallel Performance.
Supported in part by NSF grant No. CCR-9627210 and NSF HPCC/GCAG grant No. BIR-9318183.1
1 IntroductionClusters of symmetric multiprocessors (SMPs) have emerged as the primary candidates for large scalemultiprocessor systems. In spite of this trend, relatively little work has been done to develop techniquesfor designing algorithms which make eective use of the resources available on such platforms. Thistask is made dicult by the contrasting requirements of the platform. On the one hand, each SMPmust be viewed on its own as a hierarchical shared memory machine. Good performance requiresboth good load distribution and the minimization of main memory access. On the other hand, fromthe perspective of the cluster, each node is in eect a superproccessor, and therefore the cluster ofSMPs is a collection of powerful processors connected by a communication network. Maximizing theperformance of such a distributed memory machine requires both ecient load balancing and regularbalanced communication.In this paper, we examine the problem of sorting on SMP clusters. Sorting is arguably the moststudied problem in computer science, due in large part to its pervasiveness. But it is also intrinsicallyinteresting because of its demanding requirements for irregular memory access and interprocessor com-munication. From the perspective of the cluster, any algorithm which performs well for distributedmemory machines would be a reasonable candidate. In particular, we have identied two such algo-rithms in our previous work [11, 10, 12]. The rst is a variation on sample sort and the other is avariation on the approach of sorting by regular sampling. On the other hand, from the perspectiveof the individual SMP, there are fewer choices for sorting on hierarchical shared memory machines.These include distribution sort [4], greed sort [13], balance sort [14], sharesort [3], simple randomizedmerge sort [8], radix sort, and the sorting algorithm of Varman et al. [16, 15]. Unfortunately, noneof these algorithms by themselves is sucient to achieve ecient performance on an SMP cluster.Ecient algorithms for distributed memory machines tend to conne interprocessor communicationto a minimum number of regular balanced exchanges. By contrast, algorithms for shared memorymachines typically capitalize on the fact that accessing data associated with another processor is nomore expensive than accessing its own data in the shared memory.We introduce a sorting algorithm for clusters of SMPs which is a hybrid of our modied algorithmsfor parallel sorting by random sampling and parallel sorting by deterministic sampling. Our algorithmwas implemented in C using Posix Threads and runs on a cluster of DEC AlphaServer 2100A systems.We ran our code using a variety of benchmarks that we have identied to examine the dependence ofour algorithm on the input distribution. Our experimental results verify the scalability and eciencyof our proposed solution and illustrate the importance of considering both memory hierarchy and theoverhead of shifting to multiple nodes.The organization of the paper is as follows. Section 2 presents our computational model foranalyzing algorithms for SMP clusters. Section 3 describes in detail our sorting algorithm for this2
platform. Finally, Section 4 describes the experimental performance of our sorting algorithm on anSMP cluster.2 Computational ModelFor our purposes, the cost of an algorithm needs to include a measure that reects the number andtype of memory accesses. A memory access can either be to a local cache, a local main memory, or aremote main memory. It is instructive to start with a brief overview of a number of models that havebeen proposed to capture the performance of multilevel hierarchical memories.Many of the models in the literature are specically limited to two-level memories. Aggarwaland Vitter [4] rst proposed a simple model for main memory and disks which recognized the im-portance of spatial locality. In their uniprocessor model, D blocks of B contiguous records can betransferred between primary and secondary memory in a single I/O. Vitter and Shriver [17] proposedthe more realistic D-disk model, in which secondary storage is partitioned into D physically distinctdisk drives. In this model, D blocks can be transfered in a single I/O but only if no two blocksare from the same disk. Finally, Vitter and Shriver [17] extended the D-disk model to the paralleldomain. In the so-called parallel disk model, the processors are directly connected to one anotherby some interconnection network which can be modeled by any of the standard models (e.g. PRAM,xed-connection network). For all these models, the cost of accessing data on disk was substantiallyhigher than internal computation, and, hence, the sole measure of performance used is the number ofparallel I/Os.Alternatively, there are a number of models which allow for any arbitrary number of memorylevels. Focusing on the fact that access to dierent levels of memory are achieved at diering costs,Aggarwal et al. [1] introduced the Hierarchical Memory Model (HMM), in which access to locationx requires time f(x), where f(x) is any monotonic nondecreasing function. Taking note of the factthat the latency of memory access makes it economical to fetch a block of data, Aggarwal, Chandra,and Snir [2] extended this model to the Hierarchical Memory with Block Transfer Model (BT). Inthis model, accessing t consecutive locations beginning with location x requires time f(x)+ t. Finally.Nodine and Vitter [14] generalized this model to the parallel domain in exactly the same fashion asthe D-disk model. The result is the Parallel Hierarchical Model with Block Transfer (P-BT), in whichthe individual memory hierarchies are linked at the CPUs by some appropriate network topology.These models all assume that while the buses which connect the various levels of memory might besimultaneously active, this only occurs in order to cooperate on a single transfer. Partly in responseto this limitation, Alpern et al. [5] proposed the Uniform Memory Hierarchy Model (UMH). Inthis model, the lth memory level consists of l blocks, each of size l, and a block of data can betransfered between level l+1 and level l in time l=b(l), where b(l) is the bandwidth. This model has3
been extended to the parallel domain in two ways as described in [5] and [13]. However, the UMHmodel is unnecessarily complicated for use with clusters of SMPs and requires signicant renementsto capture the corresponding hybrid architecture.In our complexity model, each SMP is viewed as a two-level hierarchy for which good performancerequires both good load distribution and the minimization of secondary memory access. The clusteris viewed as a collection of powerful processors connected by a communication network. Maximizingperformance on the cluster requires both ecient load balancing and regular balanced communication.Hence, our performance model combines two separate but complimentary models.Note that our approach is distinct from two methodologies that are currently being promotedfor programming clusters of SMPs. The rst approach is to view the platform as a distributedshared memory system, using a software layer to simulate coherent shared memory between nodes bytransparently using messages to move around specic data or referenced memory pages. The secondapproach, based on message passing primitives, enforces a shared-nothing paradigm between tasks,and all communication and coordination between tasks are performed through the exchange of explicitmessages, even tasks on a node with physically shared memory. Both of these methodologies acceptineciencies in order to simplify programmability and portability. However, these current approacheslead to signicant ineciencies that will make them unacceptable for a wide range of problems.In our SMP model, we recognize that ecient algorithm design requires the ecient decompositionof the problem amongst the available processors, and so, unlike some other models for hierarchicalmemory mentioned in our introduction, we include the cost of computation in our complexity. But atthe same time, our cost model encourages the exploitation of temporal and spatial locality. Specically,memory at each SMP is seen as consisting of two levels: cache and main memory. A block of wcontiguous words can be read from or written to main memory in  + wr  time, where  is the latencyof the bus, r is the number of processors competing for access to the bus, and  is the bandwidth. Bycontrast, the transfer of w noncontiguous words would require w  + r time.We capture performance on an SMP cluster in exactly the same way as described in [11, 10, 12]. Weview a parallel algorithm as a sequence of local SMP computations interleaved with communicationsteps, where we allow computation and communication to overlap. Assuming no congestion, thetransfer of a block consisting of w contiguous words between two nodes takes  + w  time, where is the latency of the network and  is the bandwidth per node. We assume that the bisectionbandwidth is suciently high to support block permutation routing amongst the N nodes at the rateof 1=. In particular, for any subset of r nodes, a block permutation amongst the r nodes takes + w  time, where w is the size of the largest block. Using this cost model, we can evaluate thecommunication time Tcomm(n;N) of an algorithm as a function of the input size n, the number ofnodes N , and the parameters  and . The overall complexity of the cluster T(n;N) is given by the4
sum of TSMP and Tcomm(n;N).Note that our model for clusters of SMPs is similar to the parallel D-disk model for the case ofa single disk. It is also similar to the P-BT model, if f(x) is chosen to be a step function equal tozero for all those locations in cache and equal to some positive constant for all other locations in mainmemory. However, there are certain dierences. First, the parallel version of these models considersthe processors to be joined by some network topology at the processor level. By contrast, we view theprocessors at a node to be joined on the level of shared main memory and only the nodes themselves tobe joined by an interconnection network, and, moreover, we explicitly include the computation cost.3 Sorting AlgorithmsFor simplicity of presentation, we rst consider an algorithm for sorting on a single SMP, and thendescribe an algorithm for sorting on clusters of SMPs.3.1 Sorting on a Single SMPSeveral sorting algorithms that have been proposed in the literature for hierarchical memory modelscan be considered for possible implementation on an SMP. For example, distribution sort [4] and itsmore rened version balance sort [14] partition the input elements into buckets using a set of approx-imately evenly spaced splitters and then sort the contents of each bucket recursively. While they areecient in memory access, they are not optimal in their computational costs since at each level ofrecursion they sort the elements in blocks corresponding to the size of the cache. Other sorting algo-rithms such as greed sort [13] and simple randomized merge sort [8] are bottom-up sorting algorithms.These are sophisticated versions of merge sort, designed to make optimal use of multiple indepen-dent disks. However, straightforward merging is inherently sequential, so without modications thesealgorithms would not be expected to perform eciently with multiple processors. Finally, there issharesort [3], which is a hybrid of the rst two approaches. Briey, the input of n elements is evenlydivided into n subsets, where  is some constant between 0 and 1, and the resulting subsets arerecursively sorted. A set of precisely evenly-spaced splitters are then computed to divide the sortedlists, after which the k sorted subsets which lie between each pair of splitters are merged to producethe sorted output. The actual algorithm is considerably more complex, since the authors designed itfor very general hierarchical memory models.Since an SMP is considerably simpler than the intended target of these algorithms, there are atleast three other algorithms which merit consideration. The rst is an adaptation of radix sort, inwhich each processor starts by computing a histogram for its portion of the input elements. The resultsof all p histograms are then shared so that each processor can compute its oset values in a globalcontext, which are then used to rearrange the data in parallel. However, depending on the input,5
this data rearrangement could cause very poor cache performance, which would make this algorithmuncompetitive with other possible choices.Another approach proposed by Varman et al. [16, 15]. starts by having each processor sorts npelements from the input set using C-way merge sort, where C is the size of the cache. Next, a subset ofp evenly spaced samples are drawn from each of the p sorted sequences, after which a set of p splittersis identied which partition the subset of samples evenly The size of the sample set is then doubledand the process repeated until the set of samples includes all n input values.Yet another algorithm is an adaptation of our sorting by regular sampling [11, 12], which weoriginally developed for distributed memory machines. As modied for an SMP, this algorithm issimilar to the parallel sorting by regular sampling (PSRS) algorithm, except that our algorithm canbe easily implemented as a stable sort.The pseudocode for our algorithm is as follows, where C is the size of the cache and L is the cacheline size: (1) Each processor Pi (1  i  p) sorts the subsequence of the n input elements with indices (i 1)np + 1 through  inp  as follows:{ (A) Sort each block of m input elements using an appropriate sequential algorithm, wherem  C. For integers we use the radix sort algorithm, whereas for oating point numberswe use the merge sort algorithm.{ (B) For j = 1 up to  log(n=pm)log(z) , merge the sorted blocks of size mz(j 1) using z-waymerge, where z  CL . (2) Each processor Pi selects each k npsth element as a sample, for (1  k  s) and a givenvalue of s p  s  np2. (3) Processor Pp merges the p sorted subsequences of samples and then selects the (ks)th sampleas Splitter[k], for (1  k  p   1). By default, the pth splitter is the largest value allowed bythe data type used. Additionally, binary search is used to compute for the set of samples withindices 1 through (ks) the number of samples Est[k] which share the same value as Splitter[k]. Step (4): Each processor Pi uses binary search to dene an index b(i;j) for each of the p sortedinput sequences created in Step (1). If we dene T(i;j) as a subsequence containing the rst b(i;j)elements in the jth sorted input sequence, then the set of p subsequences fT(i;1); T(i;2); :::; T(i;p)gwill contain all those values in the input set which are strictly less then Splitter[i] and at mostEst[i] nps elements with the same value as Splitter[i]. Step (5): Each processor Pi merges those subsequences of the sorted input sequences which liebetween indices b((i 1);j) and b(i;j) using p-way merge.6
It is straightforward to see how this algorithm can be implemented as a stable integer sort. Step(1A) sorts the input elements block by block. As long as a stable sequential sort is used and therelative order of the blocks is preserved, stability is maintained. The z-way merge in Step (1B) canbe done using a tree of losers. If ties are resolved by choosing the element from that block whichappears rst in the input, then stability will be preserved. Steps (2) and (3) do not directly aectstability. Step (4) partitions the p sorted sequences generated by Step (1B). If the Est[i] npselements with the same value as Splitter[i] are chosen in a greedy fashion beginning with the rstsequence and proceeding in order, stability will be preserved. Finally, stability can be preserved inthe merging of Step (5) in exactly the same fashion as Step (1B). Hence, our algorithm results ina stable integer sort.Before establishing the complexity of this algorithm, we need the results of the following theoremdeveloped in [11, 12]:Theorem 1: At the completion of the partitioning in Step (4), no more than np + ns   p elementswill be associated with any splitter, for n  p2 and p  s  np2.With the results of Theorem 1, the analysis of our algorithm for sorting by regular samplingis as follows. The cost of sequentially sorting np elements in blocks of size m in Step (1A) de-pends on the data type - sorting integers using radix sort requires O np + + n time, whereassorting oating point numbers using merge sort requires O np logm+ + n time. Step (1B) in-volves  log(n=pm)log(z)  rounds of z-way merge beginning with  npm blocks of size m, which requires onlyO np log  npm+ npm + n log(n=pm)log(z)  time because of spatial locality. The reading of s noncontiguoussamples in Step (2) requires O  s+ s  + p time. Step (3) involves a p-way merge of blocks of sizes followed by p binary searches on segments of size s. Since only one processor is active in Step (3),it requires O  sp log(p) + p log(s) + ps  time. Step (4) involves p binary searches on segments of sizenp . Since these reads are noncontiguous, they require O p log np+ p log np   + p  time, Finally,Step (5), involves a p-way merge of p blocks of total maximum size np + ns   p, requiring at mostO np + ns log p+ p + np + ns p time. Hence, the overall complexity of our sorting algorithm isgiven (for oating point numbers) byT (n; p) = O0@np logn +  npm + np2 + p lognp  + log  npmlog(z) n1A (1)for n  p2 log np, p  s  np2, m  C, and z  CL . Note that since  for an SMP data bus is smalland its coecient in this expression is far less than n, the complexity can be approximated by:T (n; p)  O0@np logn + log  npmlog (z) n1A (2)7
The analysis suggests that the parameters m and z should be as large as possible, subject to thestated constraints. Indeed, in such a case we believe our algorithm to be asymptotically faster thanthe performance achieved by straightforward implementations of the other algorithms on our model.However, our experimental investigation shows that the optimal choices for m and z are actuallyslightly smaller than suggested by our analysis, since they depend on a variety of factors besides thosecaptured in our model.3.2 Sorting On a Cluster of SMPsWe have already developed both a deterministic [11, 12] and a randomized [11, 10] sample sort algo-rithm which we have shown to be very ecient for message passing platforms. Either would be anappropriate choice for a cluster of SMPs, but we chose the randomized sample sort because it provedto be slightly faster in its implementation. We repeat the pseudocode here for convenience, where wereplace each sequential step where appropriate by a multithreaded SMP implementation. Note thatthe communication primitives mentioned are described in detail in [6]. Step (1): Using p threads, each node Ni (1  i  N) randomly assigns each of its nN elementsto one of N buckets. With high probability, no bucket will receive more than c1 nN2 elements,where c1 is a constant to be dened later. Step (2): Each node Ni routes the contents of bucket j to node Nj , for (1  i; j  N).Since with high probability no bucket will receive more than c1 nN2 elements, this is equivalent toperforming a transpose operation with block size c1 nN2 . Step (3): Using p threads, each node Ni sorts at most  1 nN  c1 nN  values received in Step(2) with the appropriate version of our SMP sorting algorithm, depending on the data type. Step (4): From its sorted list of   nN  c1 nN  elements, node N1 selects each j nN2th elementas Splitter[j], for (1  j  N   1). By default, Splitter[N ] is the largest value allowed by thedata type used. Additionally, for each Splitter[j], binary search is used to determine the valuesFracL[j] and FracR[j], which are respectively the fractions of the total number of elements atnode N1 with the same value as Splitter[j   1] and Splitter[j] which also lie between index(j   1) nN2 + 1 and index j nN2, inclusively. Step (5): Node N1 broadcasts the Splitter, FracL, and FracR arrays to the other N 1 nodes. Step (6): Each node Ni uses binary search on its sorted local array to dene for each of the Nsplitters a subsequence Sj . The subsequence associated with Splitter[j] contains all those valueswhich are greater than Splitter[j   1] and less than Splitter[j], as well as FracL[j] and FracR[j]of the total number of elements in the local array with the same value as Splitter[j   1] andSplitter[j], respectively. 8
 Step (7): Each node Ni routes the subsequence associated with Splitter[j] to processor Nj , for(1  i; j  N). Since with high probability no sequence will contain more than c2 nN2 elements,where c2 is a constant to be dened later, this is equivalent to performing a transpose operationwith block size c2 nN2 . Step (8): Using p threads, each node Ni merges the N sorted subsequences received in Step(7) to produce the ith column of the sorted array. Note that, with high probability, no node hasreceived more than 2 nN elements, where 2 is a constant to be dened later.Recalling that we established in [10] that with high probability c1  2, 2  2:62, and c2  5:42,the analysis of our sample sort algorithm is as follows. The randomization in Step (1) is easily donewith multiple threads by having each of the p processors at a node simultaneously assign nNp elements,requiring at most O  nNp +N+ nN time. The sorting at each node in Step (3) can be done usingp processors with our SMP sorting routine in approximately O nNp log   nNm+ log  nNpmlog(z) nN time forintegers and O nNp log   nN + log  nNpmlog(z) nN time for doubles. Steps (4) and (6)are both done usinga single thread and both require O  p log   nN + p log   nN  (+ 1=) time. Step (8) requires that wemerge N sorted sequences. This can be easily done using multiple threads by following the samescheme used in Steps(2) through (5) of our algorithm for sorting on a single SMP, requiring at mostO  nNp logN +N+ nN time. Finally, Steps (2), (5), and (7) call the communication primitivestranspose, bcast, and transpose, respectively. The analysis of these primitives in [6] shows that withhigh probability these three steps require Tcomm(n; p)   + 2n(N 1)N2 , Tcomm(n; p)   + 2(N 1) ,and Tcomm(n; p)   + 5:24n(N 1)N2 , respectively. Hence, taking note of the modest size of  and itssmall coecients in all these steps, with high probability the overall complexity of our sample sortalgorithm is given (for oating point numbers) byT (n; p)  Tcomp(n;N; p) + Tcomm(n;N; p)= O0@ nNp logn+ log  nNpmlog (z) nN +  + nN1A (3)for N2 < n3 lnn .Note that the analysis indicates that the bus bandwidth is a more serious limiting factor than thecluster interconnect bandwidth, especially in view of the fact that the bus bandwidth is not scalablelike the cluster interconnect bandwidth.4 Performance EvaluationOur algorithms were implemented using Posix Threads [9] and run on a DEC Alpha Cluster. Our DECAlpha cluster consists of 10 AlphaServer 2100A systems, each of which holds 4 Alpha 21064A proces-9
sors running each at 275 MHz. Each Alpha 21064A processor has a 16KB primary data cache and a4MB secondary data cache. The AlphaServers are connected using the Digital Gigaswitch/ATM andOC-3c adapter cards, which have a peak bandwidth rating of 155.52 Mbps. Internode communicationis eected by calls to the SIMPLE collective communication primitives [7].4.1 Sorting BenchmarksWe tested our code on a variety of benchmarks, each of which had both a 32-bit integer version anda 64-bit double precision oating point number (double) version. Our nine sorting benchmarks aredened as follows, in which n and p are assumed for simplicity but without loss of generality to bepowers of two and MAXD, the maximum value allowed for doubles, is approximately 1:8 10308.1. Uniform [U], a uniformly distributed random input, obtained by calling the C library randomnumber generator random(). This function, which returns integers in the range 0 to  231   1, isseeded by each processor Pi with the value (21+1001i). For the double data type, we \normalize"the integer benchmark values by rst subtracting the value 230 and then scaling the result by 2 30 MAXD.2. Gaussian [G], a Gaussian distributed random input, approximated by adding four calls torandom() and then dividing the result by four. For the double data type, we normalize theinteger benchmark values in the manner described for [U].3. Zero [Z], a zero entropy input, created by setting every value to a constant such as zero.4. Bucket Sorted [B], an input that is sorted into p buckets, obtained by setting the rst np2elements at each processor to be random numbers between 0 and (231p   1), the second np2elements at each processor to be random numbers between 231p and (232p   1), and so forth. Forthe double data type, we normalize the integer benchmark values in the manner described for[U].5. g-Group [g-G], an input created by rst dividing the processors into groups of consecutive pro-cessors of size g, where g can be any integer which partitions p evenly. If we index these groups inconsecutive order from 1 up to pg , then for group j we set the rst npg elements to be random num-bers between    (j   1) g + p2   1 mod p+ 1 231p and    (j   1) g + p2 mod p+ 1 231p   1,the second npg elements at each processor to be random numbers between   (j   1) g + p2 mod p+ 1 231p and    (j   1) g + p2 + 1 mod p+ 1 231p   1, and so forth.For the double data type, we normalize the integer benchmark values in the manner describedfor [U].6. Staggered [S], created as follows: if the processor index i is less than or equal to p2 , thenwe set all np elements at that processor to be random numbers between (2i  1) 231p  and10
(2i) 231p   1. Otherwise, we set all np elements to be random numbers between (2i  p  2) 231p and (2i  p  1) 231p   1. For the double data type, we normalize the integer benchmark valuesin the manner described for [U].7. Worst-Load Regular [WR] - an input consisting of values between 0 and (231 1) designed toinduce the worst possible load balance at the completion of our regular sorting. Specically, at thecompletion of sorting, the odd-indexed processors will hold (np+ns p) elements, whereas the even-indexed processors will hold (np  ns+p) elements. The benchmark is dened as follows. At proces-sor P1, for odd values of j between 1 and (p 2), the elements with indices (j   1) np2 + 1 throughj np2   1 are set to random values between (j   1)231p + 1 and j 231p   1, the elements withindices j np2 through j np2 + nsp   1 are set to j 231p , the elements with indices j np2 + nspthrough (j + 1) np2   1 are set to random values between j 231p + 1 and (j + 1)231p   1, andthe element with index (j + 1)231p  is set to (j + 1)231p . At processor P1, for j equal to (p 1),the elements with indices (j   1) np2 + 1 through j np2   1 are set to random values between(j   1)231p + 1 and j 231p   1, the elements with indices j np2 through j np2 + nsp   1 areset to j 231p , and the elements with indices j np2 + nsp through (j + 1) np2 are set to randomvalues between j 231p + 1 and (j + 1)231p   1. At processor Pi (i > 1), for odd values of jbetween 1 and p, the elements with indices (j   1) np2 + 1 through j np2 + nsp   1 are set torandom values between (j   1)231p + 1 and j 231p   1, the elements with index j np2 + nspis set to j 231p + i, and the elements with indices j np2 + nsp + 1 through (j + 1) np2 are setto random values between j 231p + 1 + i and (j + 1)231p   1. For the double data type, wenormalize the integer benchmark values in the manner described for [U].8. Deterministic Duplicates [DD], an input of duplicates in which we set all np elements at eachof the rst p2 processors to be logn, all np elements at each of the next p4 processors to be log  n2 ,and so forth. At processor Pp, we set the rst n2p elements to be log np, the next n4p elementsto be log  n2p, and so forth.9. Randomized Duplicates [RD], an input of duplicates in which each processor lls an arrayT with some constant number range (range is 32 for our work) of random values between 0 and(range 1) whose sum is S. The rst T [1]S  np values of the input are then set to a random valuebetween 0 and (range  1), the next T [2]S  np values of the input are then set to another randomvalue between 0 and (range  1), and so forth.See [10] for a detailed justication of these benchmarks.4.2 Experimental Results for a Single SMPWe begin by experimentally determining the optimal values of the parameters m and z, where m isthe block size in Step (1A) and z is the z-way merge used in Step (1B). Table I displays the times11
required to sort 4M doubles using four threads as a function of m and z. Clearly, performance suersdramatically when the block size reaches 4MB (512K eight byte double precision numbers), which isthe limit of the secondary cache. This is expected, since sorting a block in Step (1A) now requiresthat data be repeatedly swapped to main memory. It is more dicult to explain why the optimalvalues for m and z were 32K and 32, respectively, since this block size exceeds the 16KB primarycache but falls well short of the secondary cache size. Part of the explanation might lie in the fact thatthe overall complexity of Onp logn+ log  npm log(z) n obscures the fact that the complexity of the blocksort in Step (1a) is O np logm+ + n, which is an increasing function of m. The value of z = 32 isreasonable when we note than the actual cache line size is 32 bytes. Our tree of losers is implementedwith z 12-byte records, so the entire process could easily take place in the 16KB primary cache.For the remaining discussion, the times reported are for the experimentally optimal values of mand z and for the number of samples s = 8. The relative performance of the algorithm on dierentbenchmarks seems to conrm that this was a reasonable choice for s.Table I also provides an interesting illustration in the importance of minimizing secondary memoryaccess. If we consider the data for a block size of 2K, the execution time drops as we move from z = 2to z = 4 to z = 8. This is reasonable since we require 9 rounds of 2-way merge, 5 rounds of 4-waymerge, and only 3 rounds of 8-way merge, and each round of z-way merge is obviously another roundwhere all the input elements must be brought in from main memory. Moving from z = 8 to z = 16has little eect on the execution time since it does nothing to reduce the memory requirements, butmoving to z = 32 saves a round of memory access and, hence, the execution time is further reduced.However, the most dramatic illustration of the importance of minimizing secondary memory accesscan be found by comparing the optimal sorting time of 3:41 seconds for m = 32K and z = 32 withthe time of 9:86 seconds required to sort using only binary merge sort. Reducing memory access by acombination of block sorting and z-way merging improved performance nearly threefold.Block Denomination of z-Way MergeSize 2 4 8 16 32 64 128 256 5122K 7.24 5.50 4.37 4.24 3.83 3.80 3.98 4.08 3.644K 6.57 5.13 4.45 3.91 3.68 3.91 3.86 3.538K 6.25 4.89 4.24 3.76 3.77 3.92 3.4316K 5.73 4.53 3.99 3.92 3.86 3.5132K 5.41 4.65 3.97 3.99 3.4164K 5.14 4.29 4.29 3.78128K 5.20 4.81 4.41256K 5.98 5.56512K 7.681M 9.86 - (No z-way merge is necessary for this block size)Table I: Time (in seconds) required to sort 4M doubles using four threads as a function of M and z.12
Tables II and III display the performance of our sorting algorithm as a function of input distribu-tion for a variety of input sizes. Notice that the he performance is essentially similar for benchmarks[U], [G], and [WR] and for benchmarks [Z], [DD], and [RD]. The reason why the benchmarks in thesecond group ran signicantly faster than the benchmarks in the rst group is that the second groupof benchmarks all contained values restricted to a very small range (f0g,f0; 1; 2g, and f0; 1; :::; 32g)which results in signicant savings in time required for sorting and merging. Because there was littledierence in the performance of the benchmarks in the rst group, the remainder of this section willonly discuss the performance of our sorting algorithm on the single benchmark [U].Input BenchmarkSize [U] [G] [Z] [WR] [DD] [RD]1M 0.428 0.430 0.319 0.405 0.274 0.3112M 0.973 0.930 0.586 0.922 0.536 0.5974M 1.96 1.97 1.25 1.86 1.23 1.308M 4.06 4.05 2.66 3.81 2.68 2.74Table II: Sorting integers (in seconds) using 4 threads on a DEC AlphaServer 21000A.Input BenchmarkSize [U] [G] [Z] [WR] [DD] [RD]512K 0.397 0.394 0.320 0.421 0.337 0.3481M 0.868 0.856 0.741 0.844 0.724 0.7102M 1.64 1.72 1.39 1.73 1.40 1.514M 3.50 3.47 3.00 3.52 3.01 2.98Table III: Sorting doubles (in seconds) using 4 threads on a DEC AlphaServer 21000A.The results in Figure 1 examine the scalability of our sorting algorithm as a function of thenumber of threads. Results are shown for sorting both integers and doubles. Bearing in mind thatthese graphs are log-log plots, they show that for a xed input size n the execution time nearlyhalves when the number of threads p are doubled. The graphs in Figure 2 examine the scalabilityof our sorting algorithm as a function of problem size, for diering numbers of threads. They showthat for a xed number of processors there is an almost linear dependence between the executiontime and the total number of elements n. This might at rst appear to exceed our prediction of aO (n logn) relationship between n and the computational complexity. However, this appearance of alinear relationship is still quite reasonable when we consider that for the range of values shown log ndiers by only a factor of 1:2.4.3 Experimental Results for a Cluster of SMPsFor each experiment, the input is evenly distributed amongst the nodes. The output consists of theelements in non-descending order arranged amongst the nodes so that the elements at each node are13
Figure 1: Scalability of sorting integers and doubles with respect to the number of threads.
Figure 2: Scalability of sorting integers and doubles with respect to the problem size, for diering numbersof threads. 14
in sorted order and no element at node Ni is greater than any element at processor Nj, for all i < j.Note that in all cases the results shown for a single node were obtained using the sorting algorithm fora single SMP.Tables IV and V display the performance of our sorting algorithm as a function of input dis-tribution for a variety of input sizes. In each case, the performance is essentially independent of theinput distribution. Because of this independence, the remainder of this section will only discuss theperformance of our sorting algorithm on the single benchmark [U].Input BenchmarkSize [U] [G] [2-G] [B] [S] [Z] [WR] [DD] [RD]8M 2.28 2.31 2.23 2.21 2.36 2.11 2.27 2.16 2.1316M 4.36 4.31 4.42 4.41 4.33 4.17 4.31 4.09 4.0832M 9.22 9.53 9.26 9.17 9.31 8.93 9.36 8.99 8.9864M 16.49 17.01 16.86 17.21 17.09 16.07 16.79 16.26 16.03Table IV: Total execution time (in seconds) required to sort a variety of integer benchmarks on an 8 nodecluster using 4 threads.Input BenchmarkSize [U] [G] [2-G] [B] [S] [Z] [WR] [DD] [RD]4M 2.76 2.79 2.72 2.74 2.73 2.61 2.81 2.64 2.608M 4.89 4.86 4.80 4.76 4.85 4.57 4.80 4.61 4.5416M 9.36 9.54 9.30 9.19 9.28 8.94 9.25 9.01 8.9032M 18.71 19.31 18.68 18.27 18.54 18.16 18.98 18.23 18.31Table V: Total execution time (in seconds) required to sort a variety of double benchmarks on an 8 nodecluster using 4 threads.The results in Figure 3 examine the scalability of our sorting algorithm as a function of thenumber of nodes, for a variety of threads. Results are shown for sorting both integers and doubles.To understand these results, consider the step by step breakdown of the execution times shown inTable VI for sorting 8M integers with both 1 and 4 threads. Moving from one node to two introducesthe overhead of Steps 1-2 and 4-8, which together account for approximately 35% and 50% of thetotal execution time on 1 and 4 threads, respectively. This consumes the majority of the time wecould hope to save by sharing the work of sorting amongst two nodes. The eect is more pronouncedfor multiple threads because as our model predicts internode communication is independent of thenumber of threads. The eect of this overhead would be even more pronounced were it not for the factthat the time required for Step 3 for both 1 and 4 nodes is considerably higher than we would expectfrom sorting 4M integers on a single node. But moving between 1 and 4 nodes and 1 and 8 nodes,the time required for Step (3) scales inversely with the number of nodes, which is the expectationof our model. The failure of communication in Steps 2 and 7 to scale inversely with the number of15
nodes might at rst appear surprising. However, this performance is actually quite reasonable if werecall that for 2, 4, and 8 nodes, each node has to send approximately 2M, 1.5M, and 0.875M integersacross the network, respectively. The clear implication of the results shown here in Figure 3 andTable VI is that an algorithm must be both ecient and scalable to justify the use of multiple nodes.
Figure 3: Scalability of sorting integers and doubles with respect to the number of nodes, for dieringnumbers of threads. One Thread Four ThreadsStep(s) 1 2 4 8 1 2 4 81 0.00 0.87 0.41 0.22 0.00 0.56 0.26 0.112 0.00 1.28 0.92 0.56 0.00 1.11 0.88 0.583 11.19 7.17 2.39 1.17 3.99 3.11 0.87 0.734-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.007 0.00 1.24 0.84 0.49 0.00 1.08 1.39 0.628 0.00 0.49 0.27 0.32 0.00 0.23 0.13 0.25Total 11.19 11.05 4.83 2.76 3.99 5.09 3.53 2.29Table VI: Time required for each step of sorting 8M integers with respect to the number of nodes using1 and 4 threads.The graphs in Figure 4 examine the scalability of our sorting algorithm as a function of problemsize, for diering numbers of nodes and for 1 and 4 threads. For one thread, they show that for axed number of nodes there is an almost linear dependence between the execution time and the total16
number of elements n. This agrees with our prediction of a O (n logn) relationship between n and thecomputational complexity since for the range of values shown logn diers by only a factor of 1:2.The results for 4 threads in Figure 4 are seemingly more problematic. To understand theseresults, consider the step by step breakdown of the execution times shown in Table VII for sortingvarying numbers of integers using 8 nodes and 4 threads. Clearly, the communication costs of Steps(2) and (7) dominate the overall execution time. Strangely, increasing the problem size from 1M to2M actually decreases the communication costs, after which the communication costs scale relativelylinearly with the problem size n as expected. This immediately suggests that there is some sort ofchange in the communication protocol with packets larger than 64K bytes. The exact same behavioris seen in Table VIII, which shows the the step by step breakdown of the execution times forvarying numbers of nodes for 1M integers and 4 threads. Increasing the number of nodes from 4 to8 while holding the problem size constant decreases the packet size from 256K to 64K and creates anunexpected spike in the time required for communication. Finally, if we vary the number of threadswhile using 8 nodes to sort 1M integers, the exact same abnormality is encountered. At rst, thiswould seem unexpected, since according to our model communication costs should not depend onthe the number of threads. However, in our implementation of a transpose, each node exchanges proughly equal size packets, one for each thread. Thus, it would seem that increasing the packet sizeabove 16384 bytes causes a change in the communication protocol. Once the protocol is switched, therelative costs of communication decline and the execution time scales with problem size as our modelanticipates. Problem SizeStep(s) 1M 2M 4M 8M1 0.01 0.03 0.06 0.112 0.47 0.26 0.38 0.583 0.07 0.12 0.36 0.734-6 0.00 0.00 0.00 0.007 0.51 0.35 0.45 0.628 0.18 0.16 0.12 0.25Total 1.27 0.96 1.37 2.29Table VII: Time required for each step of sorting integers on the DEC Alpha Cluster using 8 nodes and4 threads.
17
Number of NodesStep(s) 1 2 4 81 0.00 0.06 0.03 0.012 0.00 0.14 0.13 0.473 0.43 0.30 0.14 0.074-6 0.00 0.00 0.00 0.007 0.00 0.15 0.13 0.518 0.00 0.03 0.05 0.18Total 0.48 0.69 0.49 1.27Table VIII: Time required for each step of sorting 1M integers on the DEC Alpha Cluster using 4 threads.
18
Figure 4: Scalability of sorting integers and doubles with respect to the problem size, for diering numbersof nodes and threads. 19
References[1] A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A Model for Heirarchical Memory. InProceedings of the 19th Annual ACM Symposium of Theory of Computing, pages 305{314, May1987.[2] A. Aggarwal, A. Chandra, and M. Snir. Heirarchical Memory with Block Transfer. In Proceed-ings of the 28th Annual IEEE Symposium on Foundations of Computer Science, pages 204{216,October 1987.[3] A. Aggarwal and G. Plaxton. Optimal Parallel Sorting in Multi-Level Storage. In Proceedings ofthe Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 659{668, 1994.[4] A. Aggarwal and J. Vitter. The Input/Output Complexity of Sorting and Related Problems.Communications of the ACM, 31:1116{1127, 1988.[5] B. Alpern, L. Carter, E. Feig, and T. Selker. The Uniform Memory Hierarchy Model of Compu-atation. Algorithmica, 12:72{109, 1994.[6] D.A. Bader and J. JaJa. Practical Parallel Algorithms for Dynamic Data Redistribution, MedianFinding, and Selection. Technical Report CS-TR-3494 and UMIACS-TR-95-74, UMIACS andElectrical Engineering, University of Maryland, College Park, MD, July 1995. Presented at the10th International Parallel Processing Symposium, pages 292-301, Honolulu, HI, April 15-19,1996.[7] D.A. Bader and J. JaJa. SIMPLE: A Methodology for Programming High Performance Algo-rithms on Clusters of Symmetric Multiprocessors. CS-TR-3798 and UMIACS-TR-97-48 TechnicalReport, UMIACS and Electrical Engineering, University of Maryland, College Park, MD, May1997.[8] R. Barve, E. Grove, and J. Vitter. Simple Randomized Mergesort on Parallel Disks. In Proceedingsof the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 109{118,Padua, Italy, June 1996.[9] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, andK. Yelick. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262{273, Portland, OR, November 1993.[10] D.R. Helman, D.A. Bader, and J. JaJa. A Randomized Parallel Sorting AlgorithmWith an Exper-imental Study. Technical Report CS-TR-3669 and UMIACS-TR-96-53, UMIACS and ElectricalEngineering, University of Maryland, College Park, MD, August 1996. Submitted to jpdc.20
[11] D.R. Helman, D.A. Bader, and J. JaJa. Parallel Algorithms for Personalized Communicationand Sorting With an Experimental Study. In Proceedings of the Eighth Annual ACM Symposiumon Parallel Algorithms and Architectures, pages 211{220, Padua, Italy, June 1996.[12] D.R. Helman, J. JaJa, and D.A. Bader. A New Deterministic Parallel Sorting Algorithm Withan Experimental Evaluation. Technical Report CS-TR-3670 and UMIACS-TR-96-54, UMIACSand Electrical Engineering, University of Maryland, College Park, MD, August 1996. Submittedto jea.[13] M. Nodine and J. Vitter. Large-Scale Sorting in Parallel Memories. In Proceedings of the ThirdAnnual ACM Symposium on Parallel Algorithms and Architectures, pages 29{39, Newport, RI,June 1991.[14] M. Nodine and J. Vitter. Deterministic Distribution Sort in Shared and Distributed MemoryMultiprocessors. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithmsand Architectures, pages 120{129, Velen, Germany, June 1993.[15] P. Varman, B. Iyer, D. Haderle, and S. Dunn. Parallel Merging: Algorithm and ImplementationResults. Parallel Computing, 15:165{177, 1990.[16] P. Varman, B. Iyer, and S. scheuer. A Multiprocessor Algorithm for Merging Multiple SortedLists. In Proceedings of the 1990 International Conference on Parallel Processing, pages 22{26.[17] J. Vitter and E. Shriver. Algorithms for Parallel Memory I: Two-Level Memories. Algorithmica,12:110{147, 1994.
21
