11 research outputs found

    Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem

    This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively multimethod-enabled. Nemesis has been integrated in MPICH2 as a CH3 channel and delivers better performance than other dedicated communication channels in MPICH2. Furthermore, the resulting MPICH2 architecture outperforms other MPI implementations in point-to-point benchmarks

    Adaptation des communications MPI intra-nœud aux architectures multicœurs modernes

    National audienceL'émergence des processeurs multicœurs accroît les besoins en transferts de données entre les processus à l'intérieur des machines. Comme la plupart des implémentations portables de MPI, MPICH2 utilise un schéma de communication intra-nœud reposant sur plusieurs recopies mémoire. Ce modèle souffre d'une utilisation intensive des processeurs et d'une forte pollution de cache limitant significativement les performances. Grâce à l'interface de programmation Large Message Transfer de MPICH2, conçue pour supporter un vaste panel de mécanismes de transfert, il est cependant possible de modifier cette stratégie. La mise en place d'une stratégie de copie directe basée sur l'appel système vmsplice de Linux permet d'améliorer les performances dans certains cas. Nous présentons une seconde stratégie de copie directe, reposant sur un module noyau dédié nommé KNEM. Il tire profit des capacités matérielles de déport de copie mémoire, en les activant dynamiquement selon les caractéristiques physiques des caches et de la taille des messages. Cette nouvelle solution surpasse les méthodes de transfert habituelles et la stratégie vmsplice, lorsque les cœurs sur lesquels s'exécutent les processeurs ne partagent aucun cache, ou pour des transferts de très larges messages. Les opérations de communication collectives montrent quant à elles une amélioration spectaculaire, et le test NAS IS obtient une accélération de 25% et une meilleure utilisation des caches

    Towards an MPI-like Framework for Azure Cloud Platform

    Message passing interface (MPI) has been widely used for implementing parallel and distributed applications. The emergence of cloud computing offers a scalable, fault-tolerant, on-demand al-ternative to traditional on-premise clusters. In this thesis, we investigate the possibility of adopt-ing the cloud platform as an alternative to conventional MPI-based solutions. We show that cloud platform can exhibit competitive performance and benefit the users of this platform with its fault-tolerant architecture and on-demand access for a robust solution. Extensive research is done to identify the difficulties of designing and implementing an MPI-like framework for Azure cloud platform. We present the details of the key components required for implementing such a framework along with our experimental results for benchmarking multiple basic operations of MPI standard implemented in the cloud and its practical application in solving well-known large-scale algorithmic problems

    Optimizing MPI Collective Operations for Cloud Deployments

    Cloud infrastructures are increasingly being adopted as a platform for high performance computing (HPC) science and engineering applications. For HPC applications, the Message-Passing Interface (MPI) is widely-used. Among MPI operations, collective operations are the most I/O intensive and performance critical. However, classical MPI implementations are inefficient on cloud infrastructures because they are implemented at the application layer using network-oblivious communication patterns. These patterns do not differentiate between local or cross-rack communication and hence do not exploit the inherent locality between processes collocated on the same node or the same rack of nodes. Consequently, they can suffer from high network overheads when communicating across racks. In this thesis, we present COOL, a simple and generic approach for Message-Passing Interface (MPI) collective operations. COOL enables highly efficient designs for collective operations in the cloud. We then present a system design based on COOL that describes how to implement frequently used collective operations. Our design efficiently uses the intra-rack network while significantly reducing cross-rack communication, thus improving application performance and scalability. We use software-defined networking capabilities to build more efficient network paths for I/O intensive collective operations. Our analytic evaluation shows that our design significantly reduces the network overhead across racks. Furthermore, when compared with OpenMPI and MPICH, our design reduces the latency of collective operations by a factor of log N, where N is the total number of processes, decreases the number of exchanged messages by a factor of N and reduces the network load by up to an order of magnitude. These significant improvements come at the cost of a small increase in the computation load on a few processes

    Um estudo experimental de coescalonamento em um ambiente de previsão meteorológica

    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2209O que? O tratamento, a eficiência, a comunicação e a distribuição dos processos em ambientes de computação de alto desempenho sempre foi objeto de estudo no intuito de alcançar sempre um maior desempenho. Neste sentido, com o avanço das tecnologias de processamento temos também o incremento no poder computacional dos agregados para nos auxiliar e com poder de processamento cada vez maior. Os processadores multicores, nodos multiprocessados e multicores e nodos multiprocessados estão em crescimento há algum tempo e diversos sistemas de mensuração de desempenho dos processadores foram desenvolvidos entre ele o NAS Parallel Benchmark.. Como? O desempenho dos computadores de alto desempenho é testado por benchmarks baseados na sua alta performance propriamente dita e no fluxo de trabalho, mais focado obviamente as áreas relacionadas diretamente a ciência, e em organizações de cunho público e privado. Exemplos de áreas de aplicações envolvendo grandes volumes de dados em domínios de conhecimento complexos incluem sensoriamento remoto, geoprocessamento, previsões climáticas, exploração planetária, visão computacional, área de prototipagem e robótica. Informações redundantes ou procedimentos triviais de análise podem demandar recursos computacionais muito acima das capacidades atualmente existentes. Por quê? Desse modo, técnicas de distribuição, eficiência na passagem de mensagens e intercomunicação de processos devem ser utilizadas para permitir que grandes volumes de dados sejam processados, analisados, disponibilizados e visualizados, muitas vezes em tempo real. Baseados em experimentos e com o auxílio do NAS Parallel Benchmark juntamente com a biblioteca MPI, avaliaremos o impactos das características descritas anteriormente em ambientes de alto desempenho que utilizam nodos multicores, nodos multiprocessados e nodos multiprocessados e multicores. Nosso trabalho descreverá as principais técnicas utilizadas para a distribuição de processos, eficiência na passagem de mensagens e a intercomunicação entre os nodos e uma nova abordagem sobre o assunto será utilizada para demonstrar a eficácia dos métodos

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

    Design, Implementation, and Formal Verification of On-demand Connection Establishment Scheme for TCP Module of MPICH2 Library

    Message Passing Interface (MPI) is a standard library interface for writing parallel programs. The MPI specification is broadly used for solving engineering and scientific problems on parallel computers, and MPICH2 is a popular MPI implementation developed at Argonne National Laboratory. The scalability of MPI implementations is very important for building high performance parallel computing applications. The initial TCP (Transmission Control Protocol) network module developed for Nemesis communication sub-system in the MPICH2 library, however, was not scalable in how it established connections: pairwise connections between all of an application's processes were established during the initialization of the application (the library call to MPI_Init), regardless of whether the connections were eventually needed or not. In this work, we have developed a new TCP network module for Nemesis that establishes connections on-demand. The on-demand connection establishment scheme is designed to improve the scalability of the TCP network module in MPICH2 library, aiming to reduce the initialization time and the use of operating system resources of MPI applications. Our performance benchmark results show that MPI_Init in the on-demand connection establishment scheme becomes a fast constant time operation, and the additional cost of establishing connections later is negligible. The on-demand connection establishment between two processes, especially when two processes attempt to connect to each other simultaneously, is a complex task due to race-conditions and thus prone to hard-to-reproduce defects. To assure ourselves of the correctness of the TCP network module, we modeled its design using the SPIN model checker, and verified safety and liveness properties stated as Linear Temporal Logic claims

    STAPL-RTS: A Runtime System for Massive Parallelism

    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm