275 research outputs found
MPI Collective Operations over IP Multicast
Many common implementations of Message Passing Inter- face (MPI) implement collective operations over point-to-point operations. This work examines IP multicast as a framework for collective operations. IP multicast is not reliable. If a receiver is not ready when a message is sent via IP multicast, the message is lost. Two techniques for ensuring that a message is not lost due to a slow receiving process are examined. The techniques are implemented and compared experimentally over both a shared and a switched Fast Ethernet. The average performance of collective operations is improved as a function of the number of participating processes and message size for both networks
Efficient Broadcast for Multicast-Capable Interconnection Networks
The broadcast function MPI_Bcast() from the
MPI-1.1 standard is one of the most heavily
used collective operations for the message
passing programming paradigm.
This diploma thesis makes use of a feature called
"Multicast", which is supported by several
network technologies (like Ethernet or
InfiniBand), to create an efficient MPI_Bcast()
implementation, especially for large communicators
and small-sized messages.
A preceding analysis of existing real-world
applications leads to an algorithm which does not
only perform well for synthetical benchmarks
but also even better for a wide class of
parallel applications. The finally derived
broadcast has been implemented for the
open source MPI library "Open MPI" using
IP multicast.
The achieved results prove that
the new broadcast is usually always better
than existing point-to-point implementations,
as soon as the number of MPI processes exceeds the
8 node boundary. The performance gain reaches
a factor of 4.9 on 342 nodes, because the
new algorithm scales practically independently
of the number of involved processes.Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1
Standard ist eine der meistgenutzten kollektiven
Kommunikationsoperationen des nachrichtenbasierten
Programmierparadigmas.
Diese Diplomarbeit nutzt die MulticastfÀhigkeit,
die von mehreren Netzwerktechnologien (wie Ethernet
oder InfiniBand) bereitgestellt wird, um eine
effiziente MPI_Bcast() Implementation zu erschaffen,
insbesondere fĂŒr groĂe Kommunikatoren und kleinere
NachrichtengröĂen.
Eine vorhergehende Analyse von existierenden
parallelen Anwendungen fĂŒhrte dazu, dass der neue
Algorithmus nicht nur bei synthetischen Benchmarks
gut abschneidet, sondern sein Potential bei echten
Anwendungen noch besser entfalten kann. Der
letztendlich daraus entstandene Broadcast wurde
fĂŒr die Open-Source MPI Bibliothek "Open MPI"
entwickelt und basiert auf IP Multicast.
Die erreichten Ergebnisse belegen, dass der neue
Broadcast ĂŒblicherweise immer besser als jegliche
Punkt-zu-Punkt Implementierungen ist, sobald die
Anzahl von MPI Prozessen die Grenze von 8 Knoten
ĂŒberschreitet. Der Geschwindigkeitszuwachs
erreicht einen Faktor von 4,9 bei 342 Knoten,
da der neue Algorithmus praktisch unabhÀngig
von der Knotenzahl skaliert
Evaluation of Real-Time Fiber Communications for Parallel Collective Operations
Real-Time Fiber Communications (RTFC) is a gigabit speed network that has been designed for damage tolerant local area networks. In addition to its damage tolerant characteristics, it has several features that make it attractive as a possible interconnection technology for parallel applications in a cluster of workstations. These characteristics include support for broadcast and multicast messaging, memory cache in the network interface card, and support for very fine grain writes to the network cache. Broadcast data is captured in network cache of all workstations in the network providing a distributed shared memory capability. In this paper, RTFC is introduced. The performance of standard MPI collective communications using TCP protocols over RTFC are evaluated and compared experimentally with that of Fast Ethernet. It is found that the MPI message passing libraries over traditional TCP protocols over RTFC perform well with respect to Fast Ethernet. Also, a new approach that uses direct network cache movement of buffers for collective operations is evaluated. It is found that execution time for parallel collective communications may be improved via effective use of network cache
Recommended from our members
Heterogeneous Cloud Systems Based on Broadband Embedded Computing
Computing systems continue to evolve from homogeneous systems of commodity-based servers within a single data-center towards modern Cloud systems that consist of numerous data-center clusters virtualized at the infrastructure and application layers to provide scalable, cost-effective and elastic services to devices connected over the Internet. There is an emerging trend towards heterogeneous Cloud systems driven from growth in wired as well as wireless devices that incorporate the potential of millions, and soon billions, of embedded devices enabling new forms of computation and service delivery. Service providers such as broadband cable operators continue to contribute towards this expansion with growing Cloud system infrastructures combined with deployments of increasingly powerful embedded devices across broadband networks. Broadband networks enable access to service provider Cloud data-centers and the Internet from numerous devices. These include home computers, smart-phones, tablets, game-consoles, sensor-networks, and set-top box devices. With these trends in mind, I propose the concept of broadband embedded computing as the utilization of a broadband network of embedded devices for collective computation in conjunction with centralized Cloud infrastructures. I claim that this form of distributed computing results in a new class of heterogeneous Cloud systems, service delivery and application enablement. To support these claims, I present a collection of research contributions in adapting distributed software platforms that include MPI and MapReduce to support simultaneous application execution across centralized data-center blade servers and resource-constrained embedded devices. Leveraging these contributions, I develop two complete prototype system implementations to demonstrate an architecture for heterogeneous Cloud systems based on broadband embedded computing. Each system is validated by executing experiments with applications taken from bioinformatics and image processing as well as communication and computational benchmarks. This vision, however, is not without challenges. The questions on how to adapt standard distributed computing paradigms such as MPI and MapReduce for implementation on potentially resource-constrained embedded devices, and how to adapt cluster computing runtime environments to enable heterogeneous process execution across millions of devices remain open-ended. This dissertation presents methods to begin addressing these open-ended questions through the development and testing of both experimental broadband embedded computing systems and in-depth characterization of broadband network behavior. I present experimental results and comparative analysis that offer potential solutions for optimal scalability and performance for constructing broadband embedded computing systems. I also present a number of contributions enabling practical implementation of both heterogeneous Cloud systems and novel application services based on broadband embedded computing
PCODE: an efficient and reliable collective communication protocol for unreliable broadcast domain
Existing programming environments for clusters are typically built on top of a point-to-point communication layer (send and receive) over local area networks (LANs) and, as a result, suffer from poor performance in the collective communication part. For example, a broadcast that is implemented using a TCP/IP protocol (which is a point-to-point protocol) over a LAN is obviously inefficient as it is not utilizing the fact that the LAN is a broadcast medium. We have observed that the main difference between a distributed computing paradigm and a message passing parallel computing paradigm is that, in a distributed environment the activity of every processor is independent while in a parallel environment the collection of the user-communication layers in the processors can be modeled as a single global program. We have formalized the requirements by defining the notion of a correct global program. This notion provides a precise specification of the interface between the transport layer and the user-communication layer. We have developed PCODE, a new communication protocol that is driven by a global program and proved its correctness.
We have implemented the PCODE protocol on a collection of IBM RS/6000 workstations and on a collection of Silicon Graphics Indigo workstations, both communicating via UDP broadcast. The experimental results we obtained indicate that the performance advantage of PCODE over the current point-to-point approach (TCP) can be as high as an order of magnitude on a cluster of 16 workstations
Performance evaluation of an open distributed platform for realistic traffic generation
Network researchers have dedicated a notable part of their efforts
to the area of modeling traffic and to the implementation of efficient traffic
generators. We feel that there is a strong demand for traffic generators
capable to reproduce realistic traffic patterns according to theoretical
models and at the same time with high performance. This work presents an open
distributed platform for traffic generation that we called distributed
internet traffic generator (D-ITG), capable of producing traffic (network,
transport and application layer) at packet level and of accurately replicating
appropriate stochastic processes for both inter departure time (IDT) and
packet size (PS) random variables. We implemented two different versions of
our distributed generator. In the first one, a log server is in charge of
recording the information transmitted by senders and receivers and these
communications are based either on TCP or UDP. In the other one, senders and
receivers make use of the MPI library. In this work a complete performance
comparison among the centralized version and the two distributed versions of
D-ITG is presented
Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming
Loosely coupled programming is a powerful paradigm for rapidly creating
higher-level applications from scientific programs on petascale systems,
typically using scripting languages. This paradigm is a form of many-task
computing (MTC) which focuses on the passing of data between programs as
ordinary files rather than messages. While it has the significant benefits of
decoupling producer and consumer and allowing existing application programs to
be executed in parallel with no recoding, its typical implementation using
shared file systems places a high performance burden on the overall system and
on the user who will analyze and consume the downstream data. Previous efforts
have achieved great speedups with loosely coupled programs, but have done so
with careful manual tuning of all shared file system access. In this work, we
evaluate a prototype collective IO model for file-based MTC. The model enables
efficient and easy distribution of input data files to computing nodes and
gathering of output results from them. It eliminates the need for such manual
tuning and makes the programming of large-scale clusters using a loosely
coupled model easier. Our approach, inspired by in-memory approaches to
collective operations for parallel programming, builds on fast local file
systems to provide high-speed local file caches for parallel scripts, uses a
broadcast approach to handle distribution of common input data, and uses
efficient scatter/gather and caching techniques for input and output. We
describe the design of the prototype model, its implementation on the Blue
Gene/P supercomputer, and present preliminary measurements of its performance
on synthetic benchmarks and on a large-scale molecular dynamics application.Comment: IEEE Many-Task Computing on Grids and Supercomputers (MTAGS08) 200
- âŠ