11 research outputs found
Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations
This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
Recommended from our members
Implementation of MP{_}Lite for the VI Architecture
MP{_}Lite is a light weight message-passing library designed to deliver the maximum performance to applications in a portable and user friendly manner. The Virtual Interface (VI) architecture is a user-level communication protocol that bypasses the operating system to provide much better performance than traditional network architectures. By combining the high efficiency of MP{_}Lite and high performance of the VI architecture, they are able to implement a high performance message-passing library that has much lower latency and better throughput. The design and implementation of MP{_}Lite for M-VIA, which is a modular implementation of the VI architecture on Linux, is discussed in this thesis. By using the eager protocol for sending short messages, MP{_}Lite M-VIA has much lower latency on both Fast Ethernet and Gigabit Ethernet. The handshake protocol and RDMA mechanism provides double the throughput that MPICH can deliver for long messages. MP{_}Lite M-VIA also has the ability to channel-bonding multiple network interface cards to increase the potential bandwidth between nodes. Using multiple Fast Ethernet cards can double or even triple the maximum throughput without increasing the cost of a PC cluster greatly
Optimizing message-passing performance within symmetric multiprocessor systems
The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLi̲te, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations
Global Arrays over SCI : Enveis kommunikasjon i clustere
Sammendrag av hovedfagsoppgaven "Global Arrays over SCI - Enveis
kommunikasjon i clustere" av Kai-Robert Bjørnstad
Enkelte applikasjonstyper og algoritmer vanskelig å
implementere over MPI-grensesnittet. MPI-standarden
fordrer en tosidig kommunikasjonsmodell hvor både
sender og mottaker eksplisitt må delta i
kommunikasjonen. Enveis kommunikasjon baserer seg på
deltakelse kun fra senderprosessen ved utveksling av
data (typisk kommunikasjonsmetode for SMP maskiner).
Algoritmer og grensesnitt (f.eks. SHMEM) er utviklet
for maskiner hvor prosessorene har et delt adresserom.
Hovedforskjellen på enveis og tosidig kommunikasjon
ligger i hvilken programmeringsmodell som tilbys, mao.
hvordan applikasjonsprogrammereren må forholde seg til
kommunikasjon mellom prosesser.
Clustere har i utgangspunktet ikke et delt adresserom,
noe som skaper problemer ved envis kommunikasjon. Som
et resultat har man på mange måter forsøkt å skape en
illusjon av shared memory ved hjelp av både hardware
og/eller programvare, et såkalt DSM (Distributed Shared
Memory). I denne sammenheng har et bibliotek kalt
Global Arrays (GA) blitt utviklet. GA er implementert
over kommunikasjonsbiblioteket ARMCI (Aggregate Remote
Memory Copy Interface).
ARMCI er et portabelt kommunikasjonsbibliotek med fokus
på implementasjon av effektive enveis
kommunikasjonsoperasjoner i clustere og i shared
memory-maskiner. Grensesnittet er ikke standardisert,
men benyttes likefullt av flere applikasjoner, f.eks.
NWCHEM og GAMESS-UK (gjennom GA). ARMCI likner på mange
måter SHMEM-grensesnittet for enveis kommunikasjon.
Forskjellen ligger i at ARMCI fokuserer på overføring
av ikke-sekvensielle (strided) datastrukturer.
En viktig del av datakommunikasjon er å gjøre denne så
effektiv som mulig. Det fokuseres på å benytte minst
mulig tid på kommunikasjon (overhead) og mest mulig på
prosessering/regning. Med dette som utganspunkt
undersøkes det i oppgaven mulighetene for å la ARMCI
kunne kommunisere over høyhastighetsnettverket SCI. Som
et utgangspunkt skisserer oppgaven tre hovedmetoder for
implementasjon; ARMCI over SCI via TCP/IP, MPI og
SCI-driver. Prosjektet er basert på Scali AS sin
programvare pakke (med ScaFun og ScaMPI) SSP (Scali
Software Package).
Det har blir vist at bruk av TCP/IP-implementasjoner
som ScaIP over SCI introduserer en høy overhead og gir
dårligere ytelse enn allerede eksisterende
ARMCI-implementasjoner over Gigabit Ethernet. Det er
videre utviklet en ARMCI-implementasjon utelukkende
over en tosidig meldingsutvekslingsstandard, MPI. Denne
kan benyttes over SCI ved hjelp av MPI-implementasjonen
ScaMPI.
ARMCI har også blitt implementert direkte over SCI ved
hjelp av ScaFun. Denne implementasjonen ser ut til å ha
størst potensiale når det kommer til selve
datakommunikasjonen og bruk av zero- eller one-copy
protokoller. Likefullt er faller implementasjonen i
denne oppgaven på den høye overheaden forbundet med
SCI-interrupter. Ved bruk av Hyper-Threading-teknologi
fra Intel og en nødvendig thread-safe
MPI-implementasjon, blir det vist en minimal
introduksjon av overhead gjennom bruk av MPI og
samtidig høy portabilitet
One-Sided Communication for High Performance Computing Applications
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2009Parallel programming presents a number of critical challenges to application developers. Traditionally, message passing, in which a process explicitly sends data and another explicitly receives the data, has been used to program parallel applications. With the recent growth in multi-core processors, the level of parallelism necessary for next generation machines is cause for concern in the message passing community. The one-sided programming paradigm, in which only one of the two processes involved in communication actively participates in message transfer, has seen increased interest as a potential replacement for message passing.
One-sided communication does not carry the heavy per-message overhead associated with modern message passing libraries. The paradigm offers lower synchronization costs and advanced data manipulation techniques such as remote atomic arithmetic and synchronization operations. These combine to present an appealing interface for applications with random communication patterns, which traditionally present message passing implementations with difficulties.
This thesis presents a taxonomy of both the one-sided paradigm and of applications which are ideal for the one-sided interface. Three case studies, based on real-world applications, are used to motivate both taxonomies and verify the applicability of the MPI one-sided communication and Cray SHMEM one-sided interfaces to real-world problems. While our results show a number of short-comings with existing implementations, they also suggest that a number of applications could benefit from the one-sided paradigm. Finally, an implementation of the MPI one-sided interface within Open MPI is presented, which provides a number of unique performance features necessary for efficient use of the one-sided programming paradigm
Rocmeu: orientação ao recurso na modelação de aplicações paralelas e exploração cooperativa de clusters multi-SAN
O desenvolvimento de soluções paralelas para problemas com requisitos computacionais elevados tem estado limitado à exploração de sistemas de computação específicos e à utilização de abstracções altamente conotadas com a arquitectura desses sistemas.
Estes condicionalismos têm um impacto altamente desencorajador na utilização de clusters heterogéneos -- que integram múltiplas tecnologias de interligação -- quando se pretende dar respostas capazes, tanto ao nível da produtividade, como do desempenho.
Esta dissertação apresenta a orientação ao recurso como uma nova abordagem à programação paralela, unificando no conceito de recurso as entidades lógicas dispersas pelos nós de um cluster, criadas pelas aplicações em execução, e os recursos físicos que constituem o potencial de computação e comunicação da arquitectura alvo.
O paradigma introduz novas abstracções para (i) a comunicação entre recursos lógicos e (ii) a manipulação de recursos físicos a partir das aplicações.
As primeiras garantem um interface mais conveniente ao programador, sem comprometerem o desempenho intrínseco das modernas tecnologias de comunicação SAN.
As segundas permitem que o programador estabeleça, explicitamente, uma correspondência efectiva entre as entidades lógicas e os recursos físicos, por forma a explorar os diferentes padrões de localidade existentes na hierarquia de recursos que resulta da utilização de múltiplas tecnologias SAN e múltiplos nós SMP.
O paradigma proposto traduz-se numa metodologia de programação concretizada na plataforma Meu, que visa a integração do desenho/desenvolvimento de aplicações paralelas e do processo de selecção/alocação de recursos físicos em tempo de execução, em ambientes multi-aplicação e multi-utilizador.
Na base desta plataforma está o RoCl, uma outra plataforma, desenvolvido com o intuito de oferecer uma imagem de sistema uno.
Na arquitectura resultante, o primeiro nível, suportado pelo RoCl, garante a conectividade entre recursos lógicos dispersos pelos diferentes nós do cluster, enquanto o segundo, da responsabilidade do Meu, permite a organização e manipulação desses recursos lógicos, a partir de uma especificação inicial, administrativa, dos recursos físicos disponíveis.
Do ponto de vista da programação paralela/distribuída, o Meu integra adaptações e extensões dos paradigmas da programação por memória partilhada, passagem de mensagens e memória global.
Numa outra vertente, estão disponíveis capacidades básicas para a manipulação de recursos físicos em conjunto com facilidades para a criação e localização de entidades que suportam a interoperabilidade e a cooperação entre aplicações.The development of parallel solutions for high demanding computational problems has been limited to the exploitation of specific computer systems and to the use of abstractions closely related to the architecture of these systems.
These limitations are a strong obstacle to the use of heterogeneous clusters -- clusters that integrate multiple interconnection technologies -- when we intend to give capable answers to both productivity and performance.
This work presents the resource orientation as a new approach to parallel programming, unifying in the resource concept the logical entities spread through cluster nodes by applications and the physical resources that represent computation and communication power.
The paradigm introduces new abstractions for (i) the communication among logical resources and (ii) the manipulation of physical resources from applications.
The first ones guarantee a more convenient interface to the programmer, without compromising the intrinsic performance of modern SAN communication technologies.
The second ones allow the programmer to explicitly establish the effective mapping between logical entities and physical resources, in order to exploit the different levels of locality that we can find in the hierarchy of resources that results from using distinct SAN technologies and multiple SMP nodes.
The proposed paradigm corresponds to a programming methodology materialized in the Meu platform, which aims to integrate the design/development of parallel applications and the process of selecting/allocating physical resources at execution time in multi-application, multi-user environments.
The basis for this platform is RoCl, another platform, developed to offer a single system image.
The first layer of the resultant architecture, which corresponds to RoCl, guarantees the connectivity among logical resources instantiated at different cluster nodes, while the second, corresponding to Meu, allows to organize and manipulate these logical resources, starting from an initial administrative specification of the available physical resources.
In the context of parallel/distributed programming, Meu integrates adaptations and extensions to the shared memory, message passing and global memory programming paradigms.
Basic capabilities for the manipulation of physical resources along with facilities for the creation and discovery of entities that support the interoperability and cooperation between applications are also available