Abstract
Introduction
The currently most important API definition for parallel programming using the message-passing paradigm is the Message Passing Interface (MPI [10, 11] ). MPI offers numerous variants of the basic point-to-point message passing functions MPI Send and MPI Recv. Additionally, it contains a wide range of so-called collective operations (CO). They are designed to perform one-to-many (1:N), many-toone (N:1) or many-to-many (N:N) data distributions. Typical examples of such operations are MPI Bcast, which is a 1:N multicast, and MPI Reduce, which performs a N:1 £ This work was performed at the Chair for Operating Systems, Technical University of Aachen, http://www.lfbs.rwth-aachen.de data gathering with simultaneous combination of each process' data vector into one result vector. An example for an N:N operation is MPI Allreduce which can be described as MPI Reduce followed by MPI Bcast, using the result of the previous reduction.
A CO for a specified "communicator" (group) of È processes can not complete for all È processes unless all of them have invoked CO. This inherent characteristic of all collective operations allows for optimizations which are not possible for point-to-point communication. The active participation of all processes involved in the CO makes it possible to coordinate the data transfers between the processes to make optimal use of the underlying communication system. This mostly concerns the interconnect, but also other communication resources like shared memory on nodes running more than one process.
In this paper, we present optimizations for the COs mentioned above for SCI-MPICH [22] . SCI-MPICH is an MPI implementation for the SCI interconnect [7] which is utilized via the SISCI API [19, 2] . The optimizations constitute a data-transfer protocols which is based on the pipelining principle. Additionally, they make use of concurrent intra-and inter-node data movements and computation which leads to a speedup of more than 4 compared to the generic tree-based algorithms. An overview of related work in this area is given in Chapter 2. The basic aspects of intra-node communication using SCI are shown in Chapter 3. Chapter 4 explains our optimization approach and derives analytical models from the communication operations. The results of these models are compared with some experimental results in Chapter 5.
Related Work
During the last decade, a much work has been performed on collective communication in general, and on implementations of COs in MPI. Mitra et.al. [12] give basic analytical models of the most common collective operations, with different algorithms for short and long vectors and consid-eration of mesh topologies. Their algorithms contain a variant of pipelining for long vectors. For short vectors, they propose an interleaving of communication and computation (for reduce operations). Their implementation of those algorithms for the Intel Paragon achieved considerable performance increases for MPI Bcast and MPI Allreduce if compared with existing approaches. Despite this early work, when Luecke [9] evaluated the performance of collective operations on SGI and IBM system four years later, he found that on each system certain COs did not perform as well as a reasonable generic algorithm (while others performed better). Likewise, the popular open-source MPI implementation MPICH has just recently integrated a range of existing generic algorithms for COs to replace the existing ones and could achieve significant performance increases on two different platforms [20] . An optimized algorithm for MPI Reduce and MPI Allreduce, implemented by Rabenseifner [16] , also delivered an increased performance when integrated into an MPI library for the SCI interconnect [5] .
These achievements show that even generic algorithms, using message-based point-to-point communication, can increase performance if they are carefully adjusted to the characteristics of an interconnect. However, an even higher performance can be achieved if special capabilities of an interconnect are exploited by means which are not accessible through message-based point-to-point communication.
Fleischmann [3] did so by using direct shared-memory communication on the Convex, a cc-NUMA SMP system, considering the different performance levels of local and remote memory. For clusters with message-based interconnects, it is necessary to perform low-level accesses to the network adapter like Bhoedjand et.al. [1] did for Myrinet and Petrini et.al. [15] for Quadrics. Both implementations have limitations, though: for Myrinet, a custom firmware (Myrinet Control Program) is required, which many user hesitate to use. For Quadrics, the applicability of the lowlevel COs depends on the placement of the processes in the network. It is not known if Petrini's work is applied to the MPI implementation for Quadrics.
Oral and George [13] evaluated different communication topologies for multicast operations in a two-dimensional SCI torus. They achieve the best results for with multiple sequential trees along one dimension of the torus. However, they do only consider broadcast operations (no reduction operations), operate on a lower level than MPI, and do not use DMA transfers. The completion latency achieved for a broadcast of 512KB across 8 nodes is about twice as high as the performance of the approach presented in this paper. The hardware configuration of the test systems is identical to the setup used in this paper, except for slightly faster CPUs.
Sanders [17] theoretically describes a communication algorithm termed "fractional tree" for broadcast and reduction operations. It is an hybrid of a linear pipeline and a binary tree and thus is suited to alleviate the scaling problem of the linear pipelining which will show up with the performance modelling in chapter 5.3.
SCI Communication
An SCI interconnect [7] between a number of nodes is based on PCI-SCI adapter boards (PSA [8] ) which communicate via a switched fabric of point-to-point connections. This way, the fabric can have virtually any topology. Typical topologies are star (using a central switch) and k-ary n-cubes (typically two-or three-dimensional tori).
The PSA boards contain a PCI-to-PCI bridge and translate accesses to certain parts of the node-local PCI address space into accesses to the global SCI address space of all nodes. Likewise, a PSA also translates accesses to the global SCI address space (which come in as packets via the switched fabric) back to the local PCI address space if the local PCI-SCI address mapping indicates this. Furthermore, packets are routed through the PSAs on the fabric on their way from the source to the destination node.
Using a current-generation SCI interconnect, communication between processes on different nodes can be performed in two different ways:
PIO By mapping remote memory segments into the local process address space, it is possible to write or read from remote memory the same way as it is done from local memory. This means, the CPU can perform arbitrary load and store operations as the mapping is fully transparent. However, the latency for the accesses to remote memory is higher than for local memory. Next to this, different consistency semantics apply for remote memory due to additional buffering on the PCI-SCI adapters and the lack of cache-coherence for remote memory. This requires explicit memory synchronization like flushing local write buffers (flush), reloading local read buffers (load barrier) or waiting for completion of outstanding write operations (store barrier) to ensure certain memory states.
DMA
The PSA has an integrated DMA engine which allows for data transfers with very little CPU activity. Once the description of the desired transfer is loaded into the DMA engine, all data transfers are executed independently from CPU activity. For DMA into remote memory, the remote memory segment does not need to be mapped into the address space, but only needs to be "connected" as the transfers are based on physical (not virtual) addresses. Both the source and the target buffer need to be allocated via the SCI driver, or must have been registered for SCI usage 1 .
The performance characteristics of these two types of data transfer are shown in Figure ? ?. These and all other benchmarks in this paper were performed on a cluster of 8 identical nodes. Each node has two PentiumIII CPUs, 512MB of RAM and a ServerWorks ServerSet III-LE chipset, which offers a Ø, ÅÀÞ PCI bus. Benchmarks for this platform show that the minimal latency of PIO transfers of ½ × (for a ÝØ write) is much lower than for DMA which starts at about ¿¼ × (for a ÝØ write). Likewise, the bandwidth for PIO reaches 90% of peak bandwidth for blocksizes less than ½¾ ÝØ . With DMA, a blocksize of ½¾ Ã is required for 90% of peak bandwidth. However, the peak bandwidth of PIO transfers (½ ¼ Å ×) is lower than for DMA transfers (¾ ¼ Å ×). Additionally, the bandwidth of PIO transfers depends on the implementation of the interface between CPU bus and PCI bus (the "chipset") and also on the memory access performance. The latter leads to a performance decrease for blocksizes beyond ½¾ Ã (50% of the CPU cache size) on our platform. On more recent platforms, a non-decreasing peak PIO bandwidth of ¾ ¼ Å × has been observed. The DMA bandwidth, in contrast, is independent from the mentioned interface, CPU performance and any cache effects.
Employing Efficient Pipelining
Pipelining is a well known technique to reduce the processing time Ì of AE Ø tasks which have to pass a number AE × of sequential stages with identical processing latencies Ð × 2 . Naive sequential processing would lead to
with only 1 active stage at a time. Pipelined processing results in
for total processing time, with more than one (up to all) stage being active except for the very first and last processing step. In (1), AE × identifies the stages which fill up the pipeline, while AE Ø stages are processed in parallel. For pipelining a single task (like broadcasting a given amount of data), it is necessary to split it into sub-tasks which can be processed independently. In this case, the impact of Ð × increases as it does not occur only ½ ¡ AE × times, but AE Ø ¡ AE × times. As Ð × also contains a certain amount of overhead (the communication startup latency), this may result in reduced performance.
Therefore, when COs are based on MPI Send and MPI Recv, using pipelining is usually less efficient than using other communication topologies. For maximal performance of collective operations, it is crucial to achieve high concurrency of all stages. With all stages doing the same work independently from each other, this can also be achieved by using other processing schedules, especially tree-oriented topologies. A binary tree has the advantage of reaching maximal communication parallelism in Ç´Ð ´AE µµ instead of Ç´AE µ steps. Additionally, the total number of communication steps is lower than for pipelining. The bandwidth of a 1:N-or N:1-style CO for a vector of size Ú can be defined as
, with Ì Ç being the time difference between the first call and the last exit of the collective function by any process involved.
This shows that the applicability of pipelining for the implementation of COs is limited. However, with the lowlatency communication characteristics of SCI it is possible to define communication protocols in which Ð × contains very little overhead. Together with concurrent intra-and inter-node communication, efficient pipelining is possible as we will show below.
Generic Principle
To pipeline a single CO, the vector on which this operation is to be performed needs to be split into AE Ø parts. Each of these parts should be transfered with minimal overhead to ensure efficiency of the pipeline. This requirement also applies to the flow control needed to avoid data corruption. With SCI, the most efficient data transfer between È ¾ processes on remote nodes is to use a ring buffer of Ö Ò bytes of SCI shared memory at the receiving process. The sending process writes data into the ring buffer, bytes at a time. The receiving process reads the data with the same granularity. The flow control is realized via two additional locations in shared memory which are updated by the processes according to the position in the ring buffer up to which they have written or read data. This technique has the potential disadvantage that only one transfer can be handled at a time. For COs this is not a problem as MPI does not allow concurrent COs.
For a pipelined transfer with È ¾ processes, each process Ô which has to receive data allocates an inbound ring buffer in its local SCI shared memory and tells process Ô ½ how to access this buffer. Once it has received such information from process Ô ·½ (which is the next stage in the pipeline) for the outbound buffer, it polls the inbound buffer for data to arrive. Once this data arrives, it processes it locally as required (i.e. copying it into the local receive buffer that is provided by the user) and writes another block of data into the outbound buffer. Using PIO transfers, these two operations are serialized. Only when using DMA for outbound data transfers, it is possible to overlap local processing of the data and the outbound data transfer. The next chapters will explain how this is realized for the different COs.
The following parameters are required for the analytical models presented in the next chapters: È number of processes which take part in the CO Ú size of vector to be communicated size of data block to be transfered without flow-control Ð ÈÁÇ´ µ latency of a PIO write operation for bytes into remote memory Ð Å ´ µ latency of a DMA write operation for bytes into remote memory Ð ÔÝ´ µ latency of a copy operation for bytes in local memory Ð Ñ ´ µ latency of a combine computation for bytes Ð × latency of a store barrier Each 1:N-or N:1-CO has a root process. For 1:N, it is the process which owns the data for all processes before the CO is performed. Likewise, with N:1, it is the process which receives data from all processes, including itself. For simplicity, we assume that the root process always has rank 0. A simple rank transformation is required to satisfy this condition for arbitrary root processes.
Pipelined Broadcast Operation
From the generic principle, we derived the pcastprotocol which is explained in Figure 1 . It illustrates an MPI Bcast for È ¿ with process Ô ¼ being the root process sending its data to Ô ½ and Ô ¾ . They have allocated ring buffers of size Ö Ò , divided into AE Ö Ò blocks. Likewise, the vector will be transfered through the pipeline in AE Ø Ú pieces. All pipelined protocols presented in this paper use not only flow control within the single pipeline, but also between different subsequent calls to distinct CO's. This situation can occur as one process at the start of the pipeline may already be done with its transmissions while other processes at the end are still busy.
We can see that Ô ¼ only needs to send its data into the current incoming block of the ring buffer of Ô ½ using PIO transfers (action 1). The reason why PIO and not DMA is used for this transfer is that the user-supplied send buffer would need to be pinned to use it as a DMA source buffer. This functionality is not yet implemented in the standard SCI driver, although experimental work has shown that it is possible to do this with little overhead [18, 23] , and the protocol is prepared to make use of this functionality once it is available. Ô ½ in turn has to transfer the data from the forward block (which was the incoming block in the previous For the pcast protocol, Ì ×Ø can be calculated according to (1):
Pipelined Reduce Operations
Similar to the pcast protocol, the rpipe protocol was defined and implemented to perform pipelined reduction operations. The data flow of the reduce pipeline is more complex (relative to the broadcast pipeline) as the data is not only forwarded, but also modified by each process. This modification (the combine operation) requires that at each process Ô , ¼ È ½, has not only one, but two active blocks in its local ring buffer:
combine block: The data contained in the combine block is currently combined with the matching block of data of the local vector, stored in the send buffer.
forward block: The block that is the combine block in step × of the pipeline becomes the forward block in step × · ½ and is forwarded to process Ô ·½ . Figure 2 shows the data flow of the rpipe protocol for È ¿ processes. The pipeline starts at process Ô ½ which writes the data via PIO into the current incoming block of the ring buffer of Ô È ½ (action 1). At the same time, Ô ¾ combines the related data of its local send vector with the data in the combine block (action 2), and forwards data from the for- The transfer time Ì Ö Ù of the rpipe protocol can be determined as
The first two lines in (3) describe the fill time of the pipeline (at Ô ½ , Ô ¼ and the È ¾ other processes). This time is longer than for the pcast protocol as there are nearly twice as many stages. Additionally, these stages perform two different operations. The remaining lines relate to the overlapped processing of the data blocks once the pipeline is filled: only the maximum of the two times (transfer via DMA or combine operation by the CPU) is relevant. Closely related to MPI Reduce is MPI Scan, a nonexclusive prefix reduction. In contrast to a plain reduction, every process Ô will have the combined vectors of all processes Ô , . This operation is also performed via the rpipe protocol. The only difference is that each combine block is not only forwarded to the next process, but is also copied into the local receive buffer. Therefore, Ì × Ò is identical to Ì Ö Ù except for the addition of Ð ÔÝ´ µ to the factor of copy operation latencies for the full pipeline 
Pipelined Global Reduce Operation
A global reduce operation (MPI Allreduce) can be performed by calling MPI Reduce followed by MPI Bcast. In case this is done with the pipelined implementations, two pipeline fill operations will occur. In contrast, an algorithm as implemented by Rabenseifner runs "continuously" but without overlapping of combine and communication operations.
It is possible to define a single-pipeline protocol for MPI Allreduceby running the pcast pipeline directly after the rpipe pipeline. This would remove one of the currently two pipeline fill times. However, such a protocol would either require to pass the data two times through each node (at the same time) or to buffer the complete vector at the root. For the first variant, the single DMA engine on the PSA will be a bottleneck. The second variant seems more worthwhile, but has not yet been implemented.
Performance Evaluation
Firstly, we evaluate the implementation of these protocols and compare the results with the generic algorithms. We will then evaluate certain characteristics of the pipeline protocols by applying our models.
Experimental Results
We have measured the optimized collective operations by running the Pallas MPI Benchmark 3 [14] on the test cluster described above. The topology of the SCI interconnect used in this cluster is a single ring. However, the topology is not relevant for the performance of the presented pipelined data transfer protocols as each node does only communicate with its direct neighbor. This means that for each inter-node communication, a different, independent SCI link segment is used.
For each type of collective operation, we compare the pipelined version with the generic algorithm found in MPICH [4] 4 . The generic algorithms use PIO-based pointto-point communication and tree-oriented communication topologies like binary or binomial trees which give a scaling property of Ç´ÐÓ ´AE µµ. The results are depicted in counts È . Per default, the pipelined protocols are used for Ú ¿¾Ã ; we also show the results for shorter vectors (using the generic algorithm) to see if this threshold is valid for all COs.
The results for MPI Bcast show that the performance of the generic algorithm decreases with every new level of the binary tree used as communication topology (steps at È ¿ and È ). For È and Ú Å , a value of È ¿ Å × is achieved. Using the pcast protocol, the corresponding performance decreases only slightly for increasing values from È ¿ ½ ¼ ¿ Å × down to È ¿ ¿ Å ×. The performance of the pcast protocol is higher than for the generic algorithm for all tests performed, even for small values of È and Ú . Depending on the vector length, the rpipe protocol is observed to be between 20% and more than 100% faster.
We observe the highest bandwidth values for È ¾ for both protocol variants. For the generic algorithm, È ¾ ³ ¾ ¡ È ¿ applies because for È ¿ , two instead of just one serialized transfers of the complete message have to be performed. In contrast, the pcast protocol performs a pipelined transfer (in this case, between just two processes) and by this achieves a bandwidth which is about 10% higher than for the generic algorithm. The significant performance decrease of about 25% for the transition from È ¾ to È ¿ shows that the bottleneck of the pcast protocol is not the first or last process in the pipeline, but the processes which need to forward the data. This applies if a value of is chosen which has a lower transfer bandwidth for DMA than for PIO (see Figure ? ? for reference). Increasing might reduce this effect, but leads to longer pipeline fill delays. This will be further evaluated by the results of the models presented in Chapter 5.3.
Finally, the performance of pcast protocol increases with the vector length Ú . This can be expected from pipeline processing: the impact of the È pipeline processing phases for the fill time decreases relatively to the number of the end-to-end data throughput stages Ú (see (2) , first factor).
The bandwidth MPI Reduce for the generic algorithm is about 50% of the bandwidth for the generic MPI Bcast, but the run of the curve looks similar. This shows that the amount of time for sending the vector is about equal to the time needed to combine the two vectors. For the rpipe protocol, the run of the curve looks different. For 
The results for MPI Scan are similar to the results for MPI Reduce for both protocols. However, due to the additional local copy operation, both protocols suffer a 25 to 33% performance decrease relative to MPI Reduce. The peak performance ratio exceeds 4 for È and Ú ¼ Ã : the rpipe protocol achieves ¼ Å × while only ½ Å × can be delivered by the generic algorithm.
The last CO that was improved via pipelining is MPI Allreduce which uses both protocols, rpipe followed by pcast. In Figure 4 , we do not only show the results for the generic algorithm and the pipelined protocols, but also include the results for the improved generic algorithm as proposed by Rabenseifner. The numbers for both, the generic algorithm and the pipelined protocols, reflect the serialized execution of MPI Reduce and MPI Bcast which results in
The results for the Rabenseifner implementation look differently. As it uses a binary exchange pattern to create a higher parallelism for the combine operation, it works best for process counts which are an exponent of 2. For other process counts, the performance decreases significantly as extra communication steps are required. Compared with the generic algorithm using a tree-topology, this leads to about a 100% performance increase for È ¾ Ò , but to only slightly better performance for other values of È. For È ¾ Ò , the pipelined protocols deliver a performance up to 33% higher than Rabenseifner for Ú ¾ Å . Below this threshold, Rabenseifner's implementation is about 20% faster. For other values of È, however, the pipelined protocols are always faster than Rabenseifner and deliver a performance which is up to twice as high.
Comparison with ScaMPI
Next to the comparison with the generic algorithms, it is worthwhile to compare the new pipelined protocols of SCI-MPICH with ScaMPI's performance for the same operations. ScaMPI [6] is a commercial MPI implementation for the SCI interconnect. We could run a direct comparison on a Cluster of Pentium 4 systems with Intel i860 chipset. This experiment took place in March 2002, using the most current version of ScaMPI at this time. Due to space limitations, we can only give the key performance values shown in Table 1 (the complete comparison is available at http://www.mp-mpich.de). Although, the test platform has a low DMA performance of only ½ ¼ Å ×, it shows 
Results from Modeling
The modeling of the pipeline protocols opens a wide range of possible explorations for performance effects of varied runtime parameters and validation of the implementation. For this paper, we confine ourself to the following questions as they can not easily be answered by experiments on the available hardware:
How relevant is the pipeline block size for the effective performance? We vary over È for different vector sizes Ú . This will show us if a single value for is sufficient, or how it may be chosen dynamically.
How does the performance develop for increasing values of È? We simulate this for different vector sizes Ú . We also compare the performance of the pipelined protocol with the generic algorithm. This will show us if switching points between these protocols should be established.
Which influence does the flow control have on the achieved performance? This will give us a hint if other flow control techniques can be used efficiently.
We have performed the simulation for the pcast protocol as a first approach of evaluation of the characteristics of pipelined transfers, using the more simple model. Figure 5 shows three charts with the results which we will use to answer the three questions above.
The top chart shows the bandwidth per process Ô ×Ø for Ú ½ Å over for different process counts È.
We can see that the choice of has a significant impact on Ô ×Ø . The optimal value of for the evaluated process counts varies between ¾¼ Ã for È and Ã for È ¾ . Additionally, it shows that the range of , in which it delivers nearly optimal performance, decreases from about ¼ Ã down to Ã if 256 instead of 8 processes are used. This means that choosing the value for becomes more important the more processes are used as the negative performance impacts for the same absolute deviance from the optimal value for increase. Therefore, needs to be chosen dynamically for each MPI communicator, depending on the number of processes in this communicator. SCI-MPICH does this currently using a simple approximation via a linear equation; more sophisticated methods are possible.
The middle chart compares the bandwidth per process ×Ø of the pcast protocol with the generic algorithm for different vector lengths Ú , varying over the process count È. For the generic algorithm, ×Ø remains nearly constant for different value of Ú because the point-to-point message bandwidth is nearly constant for the chosen values of Ú , too. In contrast, the performance of the pcast protocol depends heavily on both the process count È and the vector length Ú . This difference in the scaling characteristics doesn't come suprising as the generic algorithm scales with Ç´ÐÓ Èµ, while the pcast protocol scales with Ç´Èµ.
Again, it depends on the performance characteristics of the specific platform which algorithm is suited better for given values of È and Ú . For the shown platform, the pcast protocol should be chosen for Ú Ã . The bottom chart answers the last of our questions by showing Ô ×Ø over for different flow control delays Ø between each transfered block for Ú ½ Å . On the tested platform, Ø × applies. We can see that for the optimal block size Ã , the achieved bandwidth by about 20% if the flow control delay is increased by a factor of 4. This indicates that it is feasible, but not without impact, to use less efficient means of flow control.
Summary and Outlook
We have shown that it is possible to efficiently implement a number of collective operations in MPI by overlapping CPU-driven data transfer and combine operations inside a node and DMA-driven data transfers between nodes. Using theses transfers, we could employ new communication protocols for pipelined 1:N-and N:1-operations like MPI Bcast, MPI Reduce and MPI Scan. Even for the N:N-operation MPI Allreduce, we could achieve improvements in comparison with highly optimized nonpipelined algorithms. This shows most significantly for process numbers which are not a power of two.
Next to the experiments with the implementation of the pipelining protocols for MPI collective operations, we presented a model of these protocols which allows to predict the throughput for arbitrary process counts and data transfer performance settings. This way, we could estimate for which cases pipelining will be more efficient than the conventional tree-based algorithm. However, the wide range of parameters which influence the performance and their mutual dependencies make it difficult to determine the best choice as the underlying performance characteristics vary between the platforms. An approach of automatically tuning as presented by Vadhiyar et.al. [21] seems to be a solution for this problem.
For large process counts, the concept of a single linear pipeline was shown to be less efficient than the treebased algorithms if the vector length is not big enough. It might be worthwile to switch to more complex pipeline concepts which combine the advantages of pipelined transfers and reduced communication steps by splitting up the single pipeline into multiple sub-pipelines as it is done in the "fractional tree" algorithm proposed in [17] .
While our implementation is based on SCI, the concept can be transfered to other high-speed interconnects with remote memory access (RDMA) capabilities. The complete source of the software used to achieve the results presented in this paper is available at http://www.mp-mpich.de.
