11 research outputs found

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Optimizing message-passing performance within symmetric multiprocessor systems

    Get PDF
    The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLi̲te, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations

    Implementation of MPICH on Top of MP_Lite

    Full text link

    Global Arrays over SCI : Enveis kommunikasjon i clustere

    Get PDF
    Sammendrag av hovedfagsoppgaven "Global Arrays over SCI - Enveis kommunikasjon i clustere" av Kai-Robert Bjørnstad Enkelte applikasjonstyper og algoritmer vanskelig å implementere over MPI-grensesnittet. MPI-standarden fordrer en tosidig kommunikasjonsmodell hvor både sender og mottaker eksplisitt må delta i kommunikasjonen. Enveis kommunikasjon baserer seg på deltakelse kun fra senderprosessen ved utveksling av data (typisk kommunikasjonsmetode for SMP maskiner). Algoritmer og grensesnitt (f.eks. SHMEM) er utviklet for maskiner hvor prosessorene har et delt adresserom. Hovedforskjellen på enveis og tosidig kommunikasjon ligger i hvilken programmeringsmodell som tilbys, mao. hvordan applikasjonsprogrammereren må forholde seg til kommunikasjon mellom prosesser. Clustere har i utgangspunktet ikke et delt adresserom, noe som skaper problemer ved envis kommunikasjon. Som et resultat har man på mange måter forsøkt å skape en illusjon av shared memory ved hjelp av både hardware og/eller programvare, et såkalt DSM (Distributed Shared Memory). I denne sammenheng har et bibliotek kalt Global Arrays (GA) blitt utviklet. GA er implementert over kommunikasjonsbiblioteket ARMCI (Aggregate Remote Memory Copy Interface). ARMCI er et portabelt kommunikasjonsbibliotek med fokus på implementasjon av effektive enveis kommunikasjonsoperasjoner i clustere og i shared memory-maskiner. Grensesnittet er ikke standardisert, men benyttes likefullt av flere applikasjoner, f.eks. NWCHEM og GAMESS-UK (gjennom GA). ARMCI likner på mange måter SHMEM-grensesnittet for enveis kommunikasjon. Forskjellen ligger i at ARMCI fokuserer på overføring av ikke-sekvensielle (strided) datastrukturer. En viktig del av datakommunikasjon er å gjøre denne så effektiv som mulig. Det fokuseres på å benytte minst mulig tid på kommunikasjon (overhead) og mest mulig på prosessering/regning. Med dette som utganspunkt undersøkes det i oppgaven mulighetene for å la ARMCI kunne kommunisere over høyhastighetsnettverket SCI. Som et utgangspunkt skisserer oppgaven tre hovedmetoder for implementasjon; ARMCI over SCI via TCP/IP, MPI og SCI-driver. Prosjektet er basert på Scali AS sin programvare pakke (med ScaFun og ScaMPI) SSP (Scali Software Package). Det har blir vist at bruk av TCP/IP-implementasjoner som ScaIP over SCI introduserer en høy overhead og gir dårligere ytelse enn allerede eksisterende ARMCI-implementasjoner over Gigabit Ethernet. Det er videre utviklet en ARMCI-implementasjon utelukkende over en tosidig meldingsutvekslingsstandard, MPI. Denne kan benyttes over SCI ved hjelp av MPI-implementasjonen ScaMPI. ARMCI har også blitt implementert direkte over SCI ved hjelp av ScaFun. Denne implementasjonen ser ut til å ha størst potensiale når det kommer til selve datakommunikasjonen og bruk av zero- eller one-copy protokoller. Likefullt er faller implementasjonen i denne oppgaven på den høye overheaden forbundet med SCI-interrupter. Ved bruk av Hyper-Threading-teknologi fra Intel og en nødvendig thread-safe MPI-implementasjon, blir det vist en minimal introduksjon av overhead gjennom bruk av MPI og samtidig høy portabilitet

    One-Sided Communication for High Performance Computing Applications

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2009Parallel programming presents a number of critical challenges to application developers. Traditionally, message passing, in which a process explicitly sends data and another explicitly receives the data, has been used to program parallel applications. With the recent growth in multi-core processors, the level of parallelism necessary for next generation machines is cause for concern in the message passing community. The one-sided programming paradigm, in which only one of the two processes involved in communication actively participates in message transfer, has seen increased interest as a potential replacement for message passing. One-sided communication does not carry the heavy per-message overhead associated with modern message passing libraries. The paradigm offers lower synchronization costs and advanced data manipulation techniques such as remote atomic arithmetic and synchronization operations. These combine to present an appealing interface for applications with random communication patterns, which traditionally present message passing implementations with difficulties. This thesis presents a taxonomy of both the one-sided paradigm and of applications which are ideal for the one-sided interface. Three case studies, based on real-world applications, are used to motivate both taxonomies and verify the applicability of the MPI one-sided communication and Cray SHMEM one-sided interfaces to real-world problems. While our results show a number of short-comings with existing implementations, they also suggest that a number of applications could benefit from the one-sided paradigm. Finally, an implementation of the MPI one-sided interface within Open MPI is presented, which provides a number of unique performance features necessary for efficient use of the one-sided programming paradigm

    Rocmeu: orientação ao recurso na modelação de aplicações paralelas e exploração cooperativa de clusters multi-SAN

    Get PDF
    O desenvolvimento de soluções paralelas para problemas com requisitos computacionais elevados tem estado limitado à exploração de sistemas de computação específicos e à utilização de abstracções altamente conotadas com a arquitectura desses sistemas. Estes condicionalismos têm um impacto altamente desencorajador na utilização de clusters heterogéneos -- que integram múltiplas tecnologias de interligação -- quando se pretende dar respostas capazes, tanto ao nível da produtividade, como do desempenho. Esta dissertação apresenta a orientação ao recurso como uma nova abordagem à programação paralela, unificando no conceito de recurso as entidades lógicas dispersas pelos nós de um cluster, criadas pelas aplicações em execução, e os recursos físicos que constituem o potencial de computação e comunicação da arquitectura alvo. O paradigma introduz novas abstracções para (i) a comunicação entre recursos lógicos e (ii) a manipulação de recursos físicos a partir das aplicações. As primeiras garantem um interface mais conveniente ao programador, sem comprometerem o desempenho intrínseco das modernas tecnologias de comunicação SAN. As segundas permitem que o programador estabeleça, explicitamente, uma correspondência efectiva entre as entidades lógicas e os recursos físicos, por forma a explorar os diferentes padrões de localidade existentes na hierarquia de recursos que resulta da utilização de múltiplas tecnologias SAN e múltiplos nós SMP. O paradigma proposto traduz-se numa metodologia de programação concretizada na plataforma Meu, que visa a integração do desenho/desenvolvimento de aplicações paralelas e do processo de selecção/alocação de recursos físicos em tempo de execução, em ambientes multi-aplicação e multi-utilizador. Na base desta plataforma está o RoCl, uma outra plataforma, desenvolvido com o intuito de oferecer uma imagem de sistema uno. Na arquitectura resultante, o primeiro nível, suportado pelo RoCl, garante a conectividade entre recursos lógicos dispersos pelos diferentes nós do cluster, enquanto o segundo, da responsabilidade do Meu, permite a organização e manipulação desses recursos lógicos, a partir de uma especificação inicial, administrativa, dos recursos físicos disponíveis. Do ponto de vista da programação paralela/distribuída, o Meu integra adaptações e extensões dos paradigmas da programação por memória partilhada, passagem de mensagens e memória global. Numa outra vertente, estão disponíveis capacidades básicas para a manipulação de recursos físicos em conjunto com facilidades para a criação e localização de entidades que suportam a interoperabilidade e a cooperação entre aplicações.The development of parallel solutions for high demanding computational problems has been limited to the exploitation of specific computer systems and to the use of abstractions closely related to the architecture of these systems. These limitations are a strong obstacle to the use of heterogeneous clusters -- clusters that integrate multiple interconnection technologies -- when we intend to give capable answers to both productivity and performance. This work presents the resource orientation as a new approach to parallel programming, unifying in the resource concept the logical entities spread through cluster nodes by applications and the physical resources that represent computation and communication power. The paradigm introduces new abstractions for (i) the communication among logical resources and (ii) the manipulation of physical resources from applications. The first ones guarantee a more convenient interface to the programmer, without compromising the intrinsic performance of modern SAN communication technologies. The second ones allow the programmer to explicitly establish the effective mapping between logical entities and physical resources, in order to exploit the different levels of locality that we can find in the hierarchy of resources that results from using distinct SAN technologies and multiple SMP nodes. The proposed paradigm corresponds to a programming methodology materialized in the Meu platform, which aims to integrate the design/development of parallel applications and the process of selecting/allocating physical resources at execution time in multi-application, multi-user environments. The basis for this platform is RoCl, another platform, developed to offer a single system image. The first layer of the resultant architecture, which corresponds to RoCl, guarantees the connectivity among logical resources instantiated at different cluster nodes, while the second, corresponding to Meu, allows to organize and manipulate these logical resources, starting from an initial administrative specification of the available physical resources. In the context of parallel/distributed programming, Meu integrates adaptations and extensions to the shared memory, message passing and global memory programming paradigms. Basic capabilities for the manipulation of physical resources along with facilities for the creation and discovery of entities that support the interoperability and cooperation between applications are also available

    MPI-2 One-Sided Communications on a Giganet SMP Cluster

    No full text
    corecore