Search CORE

3 research outputs found

Optimizing the distribution of large data sets in theory and practice

Author: Becker
Boden
Bolosky
Floyd
Hellwagner
Hutchinson
Kotsopoulos
Kurmann
Kurmann
Paul
Rauch
Rauch
Rauch
Seifert
Stricker
Publication venue: 'Wiley'
Publication date: 01/01/2002
Field of study

Crossref

Optimizing memory system performance for communication in parallel computers

Author: Blelloch G.
Numrich R.
Schwabe E. J.
Stricker T.
T. Gross
T. Stricker
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Abstract Optimizing Memory System Performance for Communication in Parallel Computers

Author: T. Gross
T. Stricker
Publication venue
Publication date
Field of study

Communication in a parallel systemfrequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth. Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon. In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination. Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns

CiteSeerX