research

Multiphase complete exchange on a circuit switched hypercube

Abstract

On a distributed memory parallel computer, the complete exchange (all-to-all personalized) communication pattern requires each of n processors to send a different block of data to each of the remaining n - 1 processors. This pattern is at the heart of many important algorithms, most notably the matrix transpose. For a circuit switched hypercube of dimension d(n = 2(sup d)), two algorithms for achieving complete exchange are known. These are (1) the Standard Exchange approach that employs d transmissions of size 2(sup d-1) blocks each and is useful for small block sizes, and (2) the Optimal Circuit Switched algorithm that employs 2(sup d) - 1 transmissions of 1 block each and is best for large block sizes. A unified multiphase algorithm is described that includes these two algorithms as special cases. The complete exchange on a hypercube of dimension d and block size m is achieved by carrying out k partial exchange on subcubes of dimension d(sub i) Sigma(sup k)(sub i=1) d(sub i) = d and effective block size m(sub i) = m2(sup d-di). When k = d and all d(sub i) = 1, this corresponds to algorithm (1) above. For the case of k = 1 and d(sub i) = d, this becomes the circuit switched algorithm (2). Changing the subcube dimensions d, varies the effective block size and permits a compromise between the data permutation and block transmission overhead of (1) and the startup overhead of (2). For a hypercube of dimension d, the number of possible combinations of subcubes is p(d), the number of partitions of the integer d. This is an exponential but very slowly growing function and it is feasible over these partitions to discover the best combination for a given message size. The approach was analyzed for, and implemented on, the Intel iPSC-860 circuit switched hypercube. Measurements show good agreement with predictions and demonstrate that the multiphase approach can substantially improve performance for block sizes in the 0 to 160 byte range. This range, which corresponds to 0 to 40 floating point numbers per processor, is commonly encountered in practical numeric applications. The multiphase technique is applicable to all circuit-switched hypercubes that use the common e-cube routing strategy

    Similar works