Seveml recent papers have proposed or analyzed optimal algor~dmrs to route all-to-all personahzed commumcation (AAPC) over commumcation networks such as meshes, hypercubes and omega switches, However, the constant factors of these algorithms are often an obscure function of system parameters such as hnk speed, processor clock rate, and memory access time. In this paper we investigate these arcbltectural factors, showing the impact of the commttmcatlon style, the network routing table, and most Importantly, the local memory system, on AAPC performance and permutation routing on the Cray T3D.
Introduction
With the advent of new parallel machines, specmlized algorithms have been developed for all-to-all personahzed communicahon (AAPC) on all common supercomputer interconnects.
There is a simple upper bound for AAPC performance since the olgorlthm M bisection limlted, and optimal algorithms are easy to find for most network architectures, Most practical methods use a priori knowledge about the communication pattern, and attempt to mmlmlze l\Vlme h = f >> i, the number of elements perprocessor Permission to make digital/hard copies of all or part of this material for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copyright is by permission of ,tie ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires specific permission andlor fee. SPAA '96, Padua, Italy @ 1996 ACM 0.89791-809-6196106 ..$3.50 congestion and contention m the network. For further detads, we refer the reader to a good description of the history of AAPC and a survey of algorithms [2] . An efficient AAPC implementation M cntlcal, since lt determmes the speed of transposes and array redistribution.
Furthermore, it is a significant factor m the overall performance of commonly-used algorithms such as sorting For many modem parallel machines, the fastest sorting algorithms are based on counting algorithms (e.g., radix sorts), Again, we refer to prewous surveys of sorting algorithms and lmplementfltlons [3, 10, 2, 5] In SectIon 2 we describe an AAPC implementation on the Cmy T3D [1] that achieves performance close to nommal bandwidth. Section 3 shows how the performance of local and remote memory operations affects the end-to-end routing performance. In SectIon 4 we present performance results for a counting sort based on this AAPC implementatlotr.
AAPC on a commercial platform
In all-to-all personahzed communication every processor has a block of data to send to every other processor m the system. A simple Implementation loops through a set of send and recewe calls on every processor, and leaves lt to the message-passmg system or the network to find an adequate schedule to deliver the messages.
Wltb no synchronization, the flow control of the overloaded routing system qmckly skews schedules, resulting m poor performance To avoid this we used an optimized AAPC rout:ng schedule for all our measurements, although we observed that If there M no synchromzatlon between steps then there 1s httle dfference between the performance of a fixed schedule, a skewed schedule, ond a randomly-permuted schedule, On the T3D sending data from memory to the network N done by the processor, whale recewmg data to memory is handled by separate dedicated hardware (the deposit engine). Therefore each processor can source and smk a transfer at full speed. The T3D uses packets of up to 32 bytes sent along fixed, determmlstlc routes We are measuring a balanced AAPC, so each processor sends the same amount of data to every other processor. We mstrtrmented a simple AAPC implementation to record the performance of each mdlwdual transfer. With P processors thel-e are P~of these transfers. Figure 1 shows a histogram of transfer rates over all 51 ? x 512 routes of an AAPC without congestion control on a 512-processor T3D.
In this figure we can clearly see how congestion affects the throughput (notice also the good fit to a Poisson distribution, due to the short packets and nearly random schedules These slow routes are due to apoorinteraction between the machine's routing decisions and the linear 8-node ring patterns used by the AAPC. Routing in the x and z dimension of the T3D network is done within a torus of 8 nodes, and some routes must travel exactly 4 hops. These routes can be routed either +4 on the surface or -4 through the back-loop, with bit 1 of a processor's x and z coordinates determining which way to go. The bit patterns for the particular AAPC patterns used result in the same route being chosen every time. The solution is a revision of the AAPC patterns (see [8] Despite the modified algorithm, several routes still suffer from congestion. Specifically, routes at certain positions in the z dimension with nearby service nodes are affected (service nodes are only connected in the z and z dimensions, and are always in the "downstream" path of a regular mesh node). This is due to a flaw in the routing table: routing shortcuts should only "cut" a comer on a right turn, but the comer is also being cut on a left turn, resulting in an early comer turn in a service node rather than in a compute node. As a consequence the y link between the service node and the comer node M overloaded. This n shown m Figure 3 , with a service node inserted into a 3 x 3 section of mesh. Elimmating the erroneous comer turn from the routing table improves aggregate performance to 27,852 MByte/s (54.4 MByte/s per processor), or 73% of the nominal bisection bandwidth (note that the modification results in a more balanced network, and that no routes get longer).
However, there are still a few congested routes, due to another routing table error. Specifically, ties between surface and back-loop routes crossing service nodes are determined according to absolute distance (including the service nodes) rather than distance using regular compute nodes only.
After another routing table modification we achieve close to uniform performance on all routes, as can seen in Figure 4 , with 28,416 MByte/s fora512-processorT3D (55.5 MByte/s per processor), or TA~o of the nominal bmection bandwidth. Table 1 shows the evolution of performance during our optimization of the routing table and communication patterns. For smaller machine sizes we achieve a higher percentage of the bisection bandwidth, and we therefore speculate that some of the remaining 26% is due to flow-control packets We can either pack buffers locally at the sender, transfer contiguous blocks, and unpack them at the receiver, or we can use "chaining"
to perform the whole operation, similar to the methods used in vector machines for memory to memory transfers.s A similar problem occurs for regular transposes. The tradeoffs are described in [6] , which also introduces the copy-transfer model used to reason about the different internal data transfers revolved in a composite communication primitive such as a full permutation AAPC.
In the copy-transfer model a permutation is defined by Its memory access pattern. While the data elements are stored in a distributed array, the permutation itself n specified by a table of index pairs, where each table entry contains a source index and a destination index. Using the dwect deposit model [7] , synchronization and consistency are guaranteed by the use of hardware bamers, and the dato transfers are performed by remote stores using the messaging system For distributed memory systems the index relation table must be maintained in a certain order, in order to group all transfers for a given source-destination pair. As m the case of regular transposes, the correct tradeoff between packing buffers and cbainmg multiple gather, transfer and scatter steps together can be determined from the measured machine parameters. For the T3D we obtain two formulas, one for buffer packing and one for chained transfers, as shown below. For further details see [6] . The derived performance figures show that chained communication is a clear winner for random permutations (~Qu remote copies, where u mdlcates an indexed random access to memory). Comparmg transposes to permutations we find that the indexed access patterns in permutations are more expensive than the fixed strides found m transposes,4 An interesting speclabzation of the 3One difference between a shared-memory machine such as the Cray C90 and a dlsmbuted-memory messaging-passing machine such as the Cray T3D mamfests itself m the fixed overhead to select a new communication partner In the case of the T3D this overheads one to two orders of'magnltude hlgherthan the time to handle a single store, and therefore a one-pass permumaon algorithm (r'deahng our the cord deck") wIII not result m good performance 41n transposes the mdlces can be computed on the fly during the communlcaaon step. while for a true random permutation both the source and destination index must be loaded from local memory general random permutations are "grouped" permutations, where the source data 1s pre-sorted per destination processor. Access to the source elements is now contiguous and there is no need for a gather operation. This prlmltwe, written as I QW m the copy-transfer model, 1s a fast building block for counting sorts since the coarse local pre-sort of the data elements can be achieved with an extra store while performing the second counting step of a sorting pass. In Table 2 we compare estimated and measured performance, and calculate the fraction of nominal bisection bandwidth achieved. k is a constant stride and is assumed to be >> 1. 4 Using AAPC for a T3D counting sort
To illustrate the application performance gains possible with a fast AAPC, we have developed an efficient counting sort for the T3D. This can be used as the basic step of a radix sort algorithm or can be further refined into a single-pass sample sort. Note that all the performance-critical building blocks of a counting sort are loops with non-cacheable memory operations. and that on systems with modem high performance microprocessors arithmetic operations are cheap relative to local memory and communication operations. This permits an elegant description of the counting sort algorlthm m the copy transfer model in terms of local and remote memory system performance. It is instructive to use a common yardstick (i.e., MByte/s of data moved) to derive a hmit on memory system performance for all components of the algorithm.
We can measure these numbers with micro-benchmarks and relate them to architectural specifications. These performance hmlts can then be used to decide on the optimal way to compose local bucket scans, transposes, and permutations into a counting sort for a particular machine. Table 3 shows the measured and derived costs of the operations required for a counting sort on the T3D, For both k and u access types the average stride 1s about half the number of buckets.
Model
Meas Table 3 : Measured and derived costs of operations required for a T3D counting sort. assuming a uniform dlstrjbution of 16-bit keys.
The memory system behavior of the bucket counting loop consists of reading a contiguous stream of indices and accessing the bucket array. Each bucket access is a read-increment-write, resulting in a widely strided store. Since there 1sno additional destination index needed for the store, its access pattern 1s stnded (k) rather than indexed (w). The second local step performs the pre-sort necessary for an efficient grouped permutation on a distributed-memory message-passing machine. we would expect actual performance to be much closer to the model's prediction.
The copy-transfer model assumes that all basic steps operate on the same number of elements. This is not true for the bucket-work and key-work loops of a counting sort, and therefore the model is extended with anew parameter f, denoting the ratio between bucket work and key work. The optimal trade-off between bucket work and key work must be computed separately: an explanation of how to do this is given in [1 O]. For our counting sort algorithm, we assume a 512-processor T3D, one billion keys (230 keys total, 221 per processor), one 16-bit radix pass over 65536 buckets, and 48 bits of data attached to each key (since the DEC Alpha architecture can handle 64-bit words at the same speed as 16-or 32-bit words).
Given these parameters, j is $, and the bucket work is negligi- In practice, the measured performance of a counting sort algorithm on a 512-processor T3D is 3.99 MBytes/s per processor, corresponding to 2042 MB ytes/s for the full machine, or 255 million keys per second. For both the modelled and measured performance, the speed of local and remote memory accesses are the limiting factor, although the asymptotic O(n) complexity of counting sort and the congestion-free AAPC routing imply linear scalability across a wide range of numbers of keys and machine sizes.
A previously reported portable implementation of radix sort written in Split-C achieves a sorting performance of 4 million 32-bit keys in just under 6 seconds on an 8-processor T3D [2], for a sorting performance of approximately 330 kB ytes/s per processor. This is equivalent to two 16-bit counting sort passes, each with a memory performance of approximately 660 kBytes/s per processor.
A subsequent implementation of a highly optimized sample sort, which only requires a single counting and routing pass, sorts 4 million 64-bit integers in 0.894 seconds on a 16-processor T3D, for a sorting performance of 2.24 MBytes/s per processor [4].
5

Conclusions
We have quantified the performance benefits of a global wellsynchronized all-to-all personalized communication implementation, using the best-known message-passing mechanism on the Cray T3D. We have also shown the need to consider low-level engineering details, such as the choice of AAPC patterns and the particular routing table used, in order to fully utilize the network of a commercial machine. Our research raised the efficiency of AAPC from 41 YO to '74% of the nominal network bisection bandwidth.
In the process we realized that AAPC is not hmited to a complete exchange of contiguous data blocks, and that there are general permutations of elements involving expensive local memory operations. We used the copy-transfer model, a simple throughput-oriented model of memory system performance, to derive an upper performance bound for permuting and radix sorting on the Cray T3D. Analyzing the performance in terms of the architecture of the memory and communication systems, it is apparent that the bottlenecks lie in the memory system rather than in the communication network or the combinatorial aspects of the algorithm. We implemented a counting sort algorithm that approaches these performance bounds, sorting a billion words (16-bit key plus 48-bit data) in 4 seconds on a 512-processor T3D. 
