85 research outputs found
Exploring Fully Offloaded GPU Stream-Aware Message Passing
Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and
high-speed network interconnects. Communication libraries supporting efficient
data transfers involving memory buffers from the GPU memory typically require
the CPU to orchestrate the data transfer operations. A new offload-friendly
communication strategy, stream-triggered (ST) communication, was explored to
allow offloading the synchronization and data movement operations from the CPU
to the GPU. A Message Passing Interface (MPI) one-sided active target
synchronization based implementation was used as an exemplar to illustrate the
proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used
to explore the various performance aspects of the implementation. The offloaded
implementation shows significant on-node performance advantages over standard
MPI active RMA (36%) and point-to-point (61%) communication. The current
multi-node improvement is less (23% faster than standard active RMA but 11%
slower than point-to-point), but plans are in progress to purse further
improvements.Comment: 12 pages, 17 figure
Improving MPI Threading Support for Current Hardware Architectures
Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs
Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery
Early-bird communication is a communication/computation overlap technique
that combines fine-grained communication with partitioned communication to
improve application run-time. Communication is divided among the compute
threads such that each individual thread can initiate transmission of its
portion of the data as soon as it is complete rather than waiting for all of
the threads. However, the benefit of early-bird communication depends on the
completion timing of the individual threads. In this paper, we measure and
evaluate the potential overlap, the idle time each thread experiences between
finishing their computation and the final thread finishing. These measurements
help us understand whether a given application could benefit from early-bird
communication. We present our technique for gathering this data and evaluate
data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To
characterize the behavior of these workloads, we study the thread timings at
both a macro level, i.e., across all threads across all runs of an application,
and a micro level, i.e., within a single process of a single run. We observe
that these applications exhibit significantly different behavior. While MiniFE
and MiniQMC appear to be well-suited for early-bird communication because of
their wider thread distribution and more frequent laggard threads, the behavior
of MiniMD may limit its ability to leverage early-bird communication
Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests
International audienceNew kinds of applications with lots of threads or irregular communication patterns which rely a lot on point-to-point MPI communications have emerged. It stresses the MPI library with potentially a lot of simultaneous MPI requests for sending and receiving at the same time. To deal with large numbers of simultaneous requests, the bottleneck lies in two main mechanisms: the tag-matching (the algorithm that matches an incoming packet with a posted receive request), and the progression engine. In this paper, we propose algorithms and implementations that overcome these issues so as to scale up to thousands of requests if needed. In particular our algorithms are able to perform constant-time tag-matching even with any-source and any-tag support. We have implemented these mechanisms in our New-Madeleine communication library. Through micro-benchmarks and computation kernel benchmarks, we demonstrate that our MPI library exhibits better performance than state-of-the-art MPI implementations in cases with many simultaneous requests
Techniques for High Performance Matching
With the growth of big data application demands, improving high-performance computing (HPC) becomes an essential industry task. High-performance matching is a critical performance path for HPC communications because it significantly impacts computing performance
and profoundly affects networking performance. This dissertation focuses on improving the
high-performance matching in HPC networks to keep up with the increasingly heavy demands
of evolving applications.
This dissertation is tackling the matching problem from both the computational and network
aspects. On the one hand, the Message Passing Interface (MPI) is a de facto standard for the communication of parallel processes in an HPC network [1]. MPI has delivered an excellent performance for running large-scale scientific applications in petascale systems. Along with the petascale system, the exascale system is evolving to run even larger applications where the computing job size increases dramatically. This trend enlarges the message queues and degrades the MPI message matching performance. With the increasing requirement of big data applications, MPI message matching is a critical performance path for HPC communications. On the other hand, with the blooming of network techniques and the fast-growing size of network applications, users are seeking more enhanced, secure, and various network services. In an HPC network, the HPC cluster comprises multiple interconnected nodes in a switched network. With the integration of software-defined networking (SDN) technology into the HPC network, both the computational and network resources can be allocated efficiently according to the applications’ requirements. Thus, SDN switches are deployed in HPC networks to support high-performance, differentiated network services and guarantee the diverse users’ needs, such as firewall, load balancing, and quality of service [2]. In an SDN switch, packet classification classifies incoming packets to flows according to the rules generated in the control plane, which is a switch’s core function. Therefore, packet classification becomes a critical performance path for the HPC network.
First, this dissertation presents GenMatcher, a generic and software-only arbitrary matching framework for fast and efficient searches on packet classification. The goal is to represent arbitrary rules with efficient prefix-based tries. In order to generate efficient trie groupings and expansions to support all arbitrary rules, we propose a clustering-based grouping algorithm to group
rules based upon their bit-level similarities. Our algorithm generates near-optimal trie groupings
with low configuration times and provides significantly higher match throughput than prior techniques. Experiments with synthetic traffic show that our method can achieve a 58.9X speedup
1 compared to the baseline on a single-core processor under a given memory constraint [3].
Second, to further improve the GenMatcher performance, this dissertation proposes GenS-
Matcher, an efficient Single Instruction Multiple Data (SIMD) and cache-friendly arbitrary match-
ing framework. GenSMatcher adopts a trie node with a fixed high-fanout and a varying span for
each node depending on the data distribution. The layout of the trie node leverage cache and
modern processor features such as SIMD instructions. To support arbitrary matching, we interpret arbitrary rules into three fields: value, mask, and priority, and then propose the GenSMatcher
extraction algorithm to process the wildcard bits to support randomly positioning wildcards in
arbitrary rules. At last, we add an array of wildcard entries to the leaf entries, which stores the
wildcard rules and guarantees matching results. Experiments show that GenSMatcher outperforms GenMatcher under a large scale of the ruleset and key set regarding search time, insert
time, and memory cost. Specifically, with 5M rules, our method achieves a 2.7X speedup on
search time, and the insertion time takes ∼ 7.3 seconds, gaining a 1.38X speedup; meanwhile, the memory cost reduction is up to 6.17X.
Third, to guarantee MPI ordering feature and high-performance matching for big applications on MPI tag matching, this dissertation introduces a new hybrid data structure and match-
ing mechanism to address the performance challenges, reducing the matching operation time in
the posted receive queue (PRQ) and unexpected message queue (UMQ). The hybrid data structures are composed of tries and hash maps. We evaluate our mechanism on microbenchmarks and existing MPI applications with different numbers of processes. Experiments with synthetic
message flow show that our method can achieve a 20X search time speedup compared to the
single-core processor’s baseline. For the PICSARlite application, we integrated our Hybrid and
Intel mechanism into the MPICH library and evaluated their performance on the Ada cluster of
Texas A&M University, which has 793 general compute nodes. The experiment outcome shows that our proposed Hybrid mechanism can achieve up to 1.55X speedup compared to the MPICH library method
Communication Architectures for Scalable GPU-centric Computing Systems
In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities.
Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed.
First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address.
The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending.
The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture
Peace made, peace built?: Participation, countryside, and politics in the 2010s Colombian peace process
This thesis argues that the pursuit of participation and inclusion of all the society and inform
well the citizenry about the terms of the accord is vital to achieving peacemaking on the one
hand; and, a rural restructure, changing political parties’ informal coercive institutions and
shifting the social norm of war towards peacebuilding on the other, are crucial coordinates so
as to a routing a genuine development for Colombia. A nation that during the 2010s faced the
challenge to end its long-standing civil war between the government and the Revolutionary
Armed Forces of Colombia − People's Army (FARC-EP) rebels. I advance the argument in two
parts: first, peacemaking is divided in two chapters. One examines participation and inclusion
in the 2016 peace settlement based on democratic innovation and the ladder of citizen
participation, arguing in a constructivist way, and applying hermeneutics that inclusion does
not necessarily mean a civil society's control over the peacemaking process, being the
participation of the political society and insurgency a precondition. The second chapter of this
section focuses on the 2016 peace plebiscite, conceptually argues that personal, relational,
cultural, and structural causes are intimately related to voters’ attitudes. And quantitatively
discloses from municipal data that spaces with rural poverty, coca crops, victims, remote from
the centre and an intense presence of the rebels had positive associations with the yes vote, a
heterogeneous influence of the warring parties, and that the vote for no won at higher population
and high abstention. The second part of this thesis addresses peacebuilding through three
chapters. The first, argues that civil war has been encouraged by the grievance to reduce rural
poverty, so, based upon Latin American Structuralism and original data empirically finds a
paradox of land redistribution, intense positive effects of technical progress to defeat rural
poverty, a dependency that undermines the better rural standard of living, ditches that become
greater between centre-periphery, and the egregious effects of forced displacement for the
countryside. The second chapter of this section examines the brutality, narcotics trafficking,
and corruption enforced by active Colombian political parties (19 parties and one social
movement) from 2011 to 2020. To do so, I addressed historical contingencies of the party
politics and build a novel panel data set where the brutality composite indicator, the corruption
indicator and coca crops are response variables for the explanatory matrix of political parties
elected to executive branch positions. The findings unmask political parties who enforced or
rejected these three coercive and violent informal institutions beside divergent causes. Lastly,
in chapter five, the third part of section two, posits eight individual political preferences
(kinship, funding, perpetuation, ideology, decision-making, religion, military, and media) that
cement the norm of civil war. Hence, I carry out an experiment with all members of the 2018-
vi
2022 Colombian Congress cohort (102 subjects in the Senate and 170 in the House of
Representatives). The results indicate that the population is dominated by a selfish adapted
community with heterogeneous preferences according to subjects’ chamber or the experimental
groups (i.e., self-enforcers, dodgers, and scofflaws).A tese argumenta que a procura da participação e inclusão da sociedade, e informar bem à
cidadania sobre os termos do acordo é vital para a formulação da paz, por um lado; e a
reestruturação rural, mudar as instituições informais coercitivas dos partidos políticos, e virar a
norma social da guerra orientando-a à construção de paz, de outro lado, são coordenadas
cruciais para o roteamento de um desenvolvimento genuíno para Colômbia. Uma nação que
durante a década dos 2010 defrontou o desafio de concluir sua guerra civil de longa duração
entre o governo e a guerrilha das Forças Armadas Revolucionarias da Colômbia – Exército do
Povo (FARC-EP). Levo a cabo o argumento em duas partes: A primeira, pacificação, é dividida
em dois capítulos. Um examina a participação e inclusão no acordo de paz de 2016 baseado na
inovação democrática e a escada da participação cidadã, a discutir de uma forma construtivista
e aplicando hermenêutica que a inclusão não necessariamente significa um controle da
sociedade civil no processo de pacificação, sendo a participação da sociedade política e da
insurgência uma precondição. O segundo capítulo desta secção foca-se no plebiscito de paz de
2016, conceitualmente trata que causas pessoais, relacionais, culturais e estruturais estão
intimamente conexas com as atitudes dos votantes. E quantitativamente revela a partir de data
municipal que espaços com pobreza rural, culturas de coca, vítimas, distantes do centro e com
uma intensa presença de rebeldes têm associações positivas com o voto sim, uma influência
heterogênea das partes em conflito, e que o voto pelo não ganhou em lugares de alta densidade
demográfica e de elevada abstenção. A segunda parte da tese aborda a construção de paz
mediante três capítulos, por tanto, o primeiro fundamentado no estruturalismo latino-americano
e data original, empiricamente descobre um paradoxo na distribuição da terra, efeitos
positivamente intensos do progresso técnico a fim de vencer à pobreza rural, uma dependência
que abate um melhor standard de vida no campo, fossos que se engrandecem entre o centro e a
periferia, e os atrozes efeitos do deslocamento forçado para o campo. O segundo capítulo da
segunda parte examina a brutalidade, o narcotráfico, e corrupção reforçada pelos partidos
políticos colombianos ativos (19 partidos e um movimento social) de 2011 até 2020, para fazê lo, abordei contingências históricas da política partidária e construo um conjunto de dados
painel onde o indicador composto de brutalidade, o indicador de corrupção e as culturas de coca
são variáveis de resposta para a matriz de partidos políticos eleitos em cargos do ramo
executivo. As descobertas desmascaram partidos políticos que reforçam ou rejeitam essas três
viii
instituições informais coercitivas e violentas além de causas divergentes. Por fim, no capítulo
cinco, a terceira secção da parte dois da tese, postula oito preferências políticas individuais
(parentesco, financiamento, perpetuamento, ideologia, tomada de decisões, religião, militares e
média) que cimentam a norma de guerra civil. Assim sendo, levo a cabo um experimento com
todos os integrantes do Congresso de Colômbia da coorte 2018-2022 (102 sujeitos no Senado
e 170 na Câmara de Representantes). Os resultados indicam que a população é dominada por
uma comunidade egoísta adaptada com preferências heterogêneas segundo à câmara e grupo
experimental (i.e., auto executores, trapaceiros, e burla leis) dos sujeitos
- …