85 research outputs found

    Exploring Fully Offloaded GPU Stream-Aware Message Passing

    Full text link
    Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.Comment: 12 pages, 17 figure

    Improving MPI Threading Support for Current Hardware Architectures

    Get PDF
    Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs

    Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

    Full text link
    Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication

    Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests

    Get PDF
    International audienceNew kinds of applications with lots of threads or irregular communication patterns which rely a lot on point-to-point MPI communications have emerged. It stresses the MPI library with potentially a lot of simultaneous MPI requests for sending and receiving at the same time. To deal with large numbers of simultaneous requests, the bottleneck lies in two main mechanisms: the tag-matching (the algorithm that matches an incoming packet with a posted receive request), and the progression engine. In this paper, we propose algorithms and implementations that overcome these issues so as to scale up to thousands of requests if needed. In particular our algorithms are able to perform constant-time tag-matching even with any-source and any-tag support. We have implemented these mechanisms in our New-Madeleine communication library. Through micro-benchmarks and computation kernel benchmarks, we demonstrate that our MPI library exhibits better performance than state-of-the-art MPI implementations in cases with many simultaneous requests

    Techniques for High Performance Matching

    Get PDF
    With the growth of big data application demands, improving high-performance computing (HPC) becomes an essential industry task. High-performance matching is a critical performance path for HPC communications because it significantly impacts computing performance and profoundly affects networking performance. This dissertation focuses on improving the high-performance matching in HPC networks to keep up with the increasingly heavy demands of evolving applications. This dissertation is tackling the matching problem from both the computational and network aspects. On the one hand, the Message Passing Interface (MPI) is a de facto standard for the communication of parallel processes in an HPC network [1]. MPI has delivered an excellent performance for running large-scale scientific applications in petascale systems. Along with the petascale system, the exascale system is evolving to run even larger applications where the computing job size increases dramatically. This trend enlarges the message queues and degrades the MPI message matching performance. With the increasing requirement of big data applications, MPI message matching is a critical performance path for HPC communications. On the other hand, with the blooming of network techniques and the fast-growing size of network applications, users are seeking more enhanced, secure, and various network services. In an HPC network, the HPC cluster comprises multiple interconnected nodes in a switched network. With the integration of software-defined networking (SDN) technology into the HPC network, both the computational and network resources can be allocated efficiently according to the applications’ requirements. Thus, SDN switches are deployed in HPC networks to support high-performance, differentiated network services and guarantee the diverse users’ needs, such as firewall, load balancing, and quality of service [2]. In an SDN switch, packet classification classifies incoming packets to flows according to the rules generated in the control plane, which is a switch’s core function. Therefore, packet classification becomes a critical performance path for the HPC network. First, this dissertation presents GenMatcher, a generic and software-only arbitrary matching framework for fast and efficient searches on packet classification. The goal is to represent arbitrary rules with efficient prefix-based tries. In order to generate efficient trie groupings and expansions to support all arbitrary rules, we propose a clustering-based grouping algorithm to group rules based upon their bit-level similarities. Our algorithm generates near-optimal trie groupings with low configuration times and provides significantly higher match throughput than prior techniques. Experiments with synthetic traffic show that our method can achieve a 58.9X speedup 1 compared to the baseline on a single-core processor under a given memory constraint [3]. Second, to further improve the GenMatcher performance, this dissertation proposes GenS- Matcher, an efficient Single Instruction Multiple Data (SIMD) and cache-friendly arbitrary match- ing framework. GenSMatcher adopts a trie node with a fixed high-fanout and a varying span for each node depending on the data distribution. The layout of the trie node leverage cache and modern processor features such as SIMD instructions. To support arbitrary matching, we interpret arbitrary rules into three fields: value, mask, and priority, and then propose the GenSMatcher extraction algorithm to process the wildcard bits to support randomly positioning wildcards in arbitrary rules. At last, we add an array of wildcard entries to the leaf entries, which stores the wildcard rules and guarantees matching results. Experiments show that GenSMatcher outperforms GenMatcher under a large scale of the ruleset and key set regarding search time, insert time, and memory cost. Specifically, with 5M rules, our method achieves a 2.7X speedup on search time, and the insertion time takes ∼ 7.3 seconds, gaining a 1.38X speedup; meanwhile, the memory cost reduction is up to 6.17X. Third, to guarantee MPI ordering feature and high-performance matching for big applications on MPI tag matching, this dissertation introduces a new hybrid data structure and match- ing mechanism to address the performance challenges, reducing the matching operation time in the posted receive queue (PRQ) and unexpected message queue (UMQ). The hybrid data structures are composed of tries and hash maps. We evaluate our mechanism on microbenchmarks and existing MPI applications with different numbers of processes. Experiments with synthetic message flow show that our method can achieve a 20X search time speedup compared to the single-core processor’s baseline. For the PICSARlite application, we integrated our Hybrid and Intel mechanism into the MPICH library and evaluated their performance on the Ada cluster of Texas A&M University, which has 793 general compute nodes. The experiment outcome shows that our proposed Hybrid mechanism can achieve up to 1.55X speedup compared to the MPICH library method

    Communication Architectures for Scalable GPU-centric Computing Systems

    Get PDF
    In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities. Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed. First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address. The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending. The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture

    Full Issue

    Get PDF

    Peace made, peace built?: Participation, countryside, and politics in the 2010s Colombian peace process

    Get PDF
    This thesis argues that the pursuit of participation and inclusion of all the society and inform well the citizenry about the terms of the accord is vital to achieving peacemaking on the one hand; and, a rural restructure, changing political parties’ informal coercive institutions and shifting the social norm of war towards peacebuilding on the other, are crucial coordinates so as to a routing a genuine development for Colombia. A nation that during the 2010s faced the challenge to end its long-standing civil war between the government and the Revolutionary Armed Forces of Colombia − People's Army (FARC-EP) rebels. I advance the argument in two parts: first, peacemaking is divided in two chapters. One examines participation and inclusion in the 2016 peace settlement based on democratic innovation and the ladder of citizen participation, arguing in a constructivist way, and applying hermeneutics that inclusion does not necessarily mean a civil society's control over the peacemaking process, being the participation of the political society and insurgency a precondition. The second chapter of this section focuses on the 2016 peace plebiscite, conceptually argues that personal, relational, cultural, and structural causes are intimately related to voters’ attitudes. And quantitatively discloses from municipal data that spaces with rural poverty, coca crops, victims, remote from the centre and an intense presence of the rebels had positive associations with the yes vote, a heterogeneous influence of the warring parties, and that the vote for no won at higher population and high abstention. The second part of this thesis addresses peacebuilding through three chapters. The first, argues that civil war has been encouraged by the grievance to reduce rural poverty, so, based upon Latin American Structuralism and original data empirically finds a paradox of land redistribution, intense positive effects of technical progress to defeat rural poverty, a dependency that undermines the better rural standard of living, ditches that become greater between centre-periphery, and the egregious effects of forced displacement for the countryside. The second chapter of this section examines the brutality, narcotics trafficking, and corruption enforced by active Colombian political parties (19 parties and one social movement) from 2011 to 2020. To do so, I addressed historical contingencies of the party politics and build a novel panel data set where the brutality composite indicator, the corruption indicator and coca crops are response variables for the explanatory matrix of political parties elected to executive branch positions. The findings unmask political parties who enforced or rejected these three coercive and violent informal institutions beside divergent causes. Lastly, in chapter five, the third part of section two, posits eight individual political preferences (kinship, funding, perpetuation, ideology, decision-making, religion, military, and media) that cement the norm of civil war. Hence, I carry out an experiment with all members of the 2018- vi 2022 Colombian Congress cohort (102 subjects in the Senate and 170 in the House of Representatives). The results indicate that the population is dominated by a selfish adapted community with heterogeneous preferences according to subjects’ chamber or the experimental groups (i.e., self-enforcers, dodgers, and scofflaws).A tese argumenta que a procura da participação e inclusão da sociedade, e informar bem à cidadania sobre os termos do acordo é vital para a formulação da paz, por um lado; e a reestruturação rural, mudar as instituições informais coercitivas dos partidos políticos, e virar a norma social da guerra orientando-a à construção de paz, de outro lado, são coordenadas cruciais para o roteamento de um desenvolvimento genuíno para Colômbia. Uma nação que durante a década dos 2010 defrontou o desafio de concluir sua guerra civil de longa duração entre o governo e a guerrilha das Forças Armadas Revolucionarias da Colômbia – Exército do Povo (FARC-EP). Levo a cabo o argumento em duas partes: A primeira, pacificação, é dividida em dois capítulos. Um examina a participação e inclusão no acordo de paz de 2016 baseado na inovação democrática e a escada da participação cidadã, a discutir de uma forma construtivista e aplicando hermenêutica que a inclusão não necessariamente significa um controle da sociedade civil no processo de pacificação, sendo a participação da sociedade política e da insurgência uma precondição. O segundo capítulo desta secção foca-se no plebiscito de paz de 2016, conceitualmente trata que causas pessoais, relacionais, culturais e estruturais estão intimamente conexas com as atitudes dos votantes. E quantitativamente revela a partir de data municipal que espaços com pobreza rural, culturas de coca, vítimas, distantes do centro e com uma intensa presença de rebeldes têm associações positivas com o voto sim, uma influência heterogênea das partes em conflito, e que o voto pelo não ganhou em lugares de alta densidade demográfica e de elevada abstenção. A segunda parte da tese aborda a construção de paz mediante três capítulos, por tanto, o primeiro fundamentado no estruturalismo latino-americano e data original, empiricamente descobre um paradoxo na distribuição da terra, efeitos positivamente intensos do progresso técnico a fim de vencer à pobreza rural, uma dependência que abate um melhor standard de vida no campo, fossos que se engrandecem entre o centro e a periferia, e os atrozes efeitos do deslocamento forçado para o campo. O segundo capítulo da segunda parte examina a brutalidade, o narcotráfico, e corrupção reforçada pelos partidos políticos colombianos ativos (19 partidos e um movimento social) de 2011 até 2020, para fazê lo, abordei contingências históricas da política partidária e construo um conjunto de dados painel onde o indicador composto de brutalidade, o indicador de corrupção e as culturas de coca são variáveis de resposta para a matriz de partidos políticos eleitos em cargos do ramo executivo. As descobertas desmascaram partidos políticos que reforçam ou rejeitam essas três viii instituições informais coercitivas e violentas além de causas divergentes. Por fim, no capítulo cinco, a terceira secção da parte dois da tese, postula oito preferências políticas individuais (parentesco, financiamento, perpetuamento, ideologia, tomada de decisões, religião, militares e média) que cimentam a norma de guerra civil. Assim sendo, levo a cabo um experimento com todos os integrantes do Congresso de Colômbia da coorte 2018-2022 (102 sujeitos no Senado e 170 na Câmara de Representantes). Os resultados indicam que a população é dominada por uma comunidade egoísta adaptada com preferências heterogêneas segundo à câmara e grupo experimental (i.e., auto executores, trapaceiros, e burla leis) dos sujeitos
    corecore