275 research outputs found

    MPI Collective Operations over IP Multicast

    Get PDF
    Many common implementations of Message Passing Inter- face (MPI) implement collective operations over point-to-point operations. This work examines IP multicast as a framework for collective operations. IP multicast is not reliable. If a receiver is not ready when a message is sent via IP multicast, the message is lost. Two techniques for ensuring that a message is not lost due to a slow receiving process are examined. The techniques are implemented and compared experimentally over both a shared and a switched Fast Ethernet. The average performance of collective operations is improved as a function of the number of participating processes and message size for both networks

    Efficient Broadcast for Multicast-Capable Interconnection Networks

    Get PDF
    The broadcast function MPI_Bcast() from the MPI-1.1 standard is one of the most heavily used collective operations for the message passing programming paradigm. This diploma thesis makes use of a feature called "Multicast", which is supported by several network technologies (like Ethernet or InfiniBand), to create an efficient MPI_Bcast() implementation, especially for large communicators and small-sized messages. A preceding analysis of existing real-world applications leads to an algorithm which does not only perform well for synthetical benchmarks but also even better for a wide class of parallel applications. The finally derived broadcast has been implemented for the open source MPI library "Open MPI" using IP multicast. The achieved results prove that the new broadcast is usually always better than existing point-to-point implementations, as soon as the number of MPI processes exceeds the 8 node boundary. The performance gain reaches a factor of 4.9 on 342 nodes, because the new algorithm scales practically independently of the number of involved processes.Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1 Standard ist eine der meistgenutzten kollektiven Kommunikationsoperationen des nachrichtenbasierten Programmierparadigmas. Diese Diplomarbeit nutzt die MulticastfĂ€higkeit, die von mehreren Netzwerktechnologien (wie Ethernet oder InfiniBand) bereitgestellt wird, um eine effiziente MPI_Bcast() Implementation zu erschaffen, insbesondere fĂŒr große Kommunikatoren und kleinere NachrichtengrĂ¶ĂŸen. Eine vorhergehende Analyse von existierenden parallelen Anwendungen fĂŒhrte dazu, dass der neue Algorithmus nicht nur bei synthetischen Benchmarks gut abschneidet, sondern sein Potential bei echten Anwendungen noch besser entfalten kann. Der letztendlich daraus entstandene Broadcast wurde fĂŒr die Open-Source MPI Bibliothek "Open MPI" entwickelt und basiert auf IP Multicast. Die erreichten Ergebnisse belegen, dass der neue Broadcast ĂŒblicherweise immer besser als jegliche Punkt-zu-Punkt Implementierungen ist, sobald die Anzahl von MPI Prozessen die Grenze von 8 Knoten ĂŒberschreitet. Der Geschwindigkeitszuwachs erreicht einen Faktor von 4,9 bei 342 Knoten, da der neue Algorithmus praktisch unabhĂ€ngig von der Knotenzahl skaliert

    Evaluation of Real-Time Fiber Communications for Parallel Collective Operations

    Get PDF
    Real-Time Fiber Communications (RTFC) is a gigabit speed network that has been designed for damage tolerant local area networks. In addition to its damage tolerant characteristics, it has several features that make it attractive as a possible interconnection technology for parallel applications in a cluster of workstations. These characteristics include support for broadcast and multicast messaging, memory cache in the network interface card, and support for very fine grain writes to the network cache. Broadcast data is captured in network cache of all workstations in the network providing a distributed shared memory capability. In this paper, RTFC is introduced. The performance of standard MPI collective communications using TCP protocols over RTFC are evaluated and compared experimentally with that of Fast Ethernet. It is found that the MPI message passing libraries over traditional TCP protocols over RTFC perform well with respect to Fast Ethernet. Also, a new approach that uses direct network cache movement of buffers for collective operations is evaluated. It is found that execution time for parallel collective communications may be improved via effective use of network cache

    PCODE: an efficient and reliable collective communication protocol for unreliable broadcast domain

    Get PDF
    Existing programming environments for clusters are typically built on top of a point-to-point communication layer (send and receive) over local area networks (LANs) and, as a result, suffer from poor performance in the collective communication part. For example, a broadcast that is implemented using a TCP/IP protocol (which is a point-to-point protocol) over a LAN is obviously inefficient as it is not utilizing the fact that the LAN is a broadcast medium. We have observed that the main difference between a distributed computing paradigm and a message passing parallel computing paradigm is that, in a distributed environment the activity of every processor is independent while in a parallel environment the collection of the user-communication layers in the processors can be modeled as a single global program. We have formalized the requirements by defining the notion of a correct global program. This notion provides a precise specification of the interface between the transport layer and the user-communication layer. We have developed PCODE, a new communication protocol that is driven by a global program and proved its correctness. We have implemented the PCODE protocol on a collection of IBM RS/6000 workstations and on a collection of Silicon Graphics Indigo workstations, both communicating via UDP broadcast. The experimental results we obtained indicate that the performance advantage of PCODE over the current point-to-point approach (TCP) can be as high as an order of magnitude on a cluster of 16 workstations

    Performance evaluation of an open distributed platform for realistic traffic generation

    Get PDF
    Network researchers have dedicated a notable part of their efforts to the area of modeling traffic and to the implementation of efficient traffic generators. We feel that there is a strong demand for traffic generators capable to reproduce realistic traffic patterns according to theoretical models and at the same time with high performance. This work presents an open distributed platform for traffic generation that we called distributed internet traffic generator (D-ITG), capable of producing traffic (network, transport and application layer) at packet level and of accurately replicating appropriate stochastic processes for both inter departure time (IDT) and packet size (PS) random variables. We implemented two different versions of our distributed generator. In the first one, a log server is in charge of recording the information transmitted by senders and receivers and these communications are based either on TCP or UDP. In the other one, senders and receivers make use of the MPI library. In this work a complete performance comparison among the centralized version and the two distributed versions of D-ITG is presented

    Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming

    Full text link
    Loosely coupled programming is a powerful paradigm for rapidly creating higher-level applications from scientific programs on petascale systems, typically using scripting languages. This paradigm is a form of many-task computing (MTC) which focuses on the passing of data between programs as ordinary files rather than messages. While it has the significant benefits of decoupling producer and consumer and allowing existing application programs to be executed in parallel with no recoding, its typical implementation using shared file systems places a high performance burden on the overall system and on the user who will analyze and consume the downstream data. Previous efforts have achieved great speedups with loosely coupled programs, but have done so with careful manual tuning of all shared file system access. In this work, we evaluate a prototype collective IO model for file-based MTC. The model enables efficient and easy distribution of input data files to computing nodes and gathering of output results from them. It eliminates the need for such manual tuning and makes the programming of large-scale clusters using a loosely coupled model easier. Our approach, inspired by in-memory approaches to collective operations for parallel programming, builds on fast local file systems to provide high-speed local file caches for parallel scripts, uses a broadcast approach to handle distribution of common input data, and uses efficient scatter/gather and caching techniques for input and output. We describe the design of the prototype model, its implementation on the Blue Gene/P supercomputer, and present preliminary measurements of its performance on synthetic benchmarks and on a large-scale molecular dynamics application.Comment: IEEE Many-Task Computing on Grids and Supercomputers (MTAGS08) 200
    • 

    corecore