18 research outputs found

    Topology and affinity aware hierarchical and distributed load-balancing in Charm++

    Get PDF
    International audienceThe evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on TREEMATCH to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities

    Placement d'applications parallèles en fonction de l'affinité et de la topologie

    Get PDF
    Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications.La simulation numérique est un des piliers des Sciences et de l’industrie. La simulationmétéorologique, la cosmologie ou encore la modélisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. Dès lors,comment passer ces applications à l’échelle ? La parallélisation et les supercalculateurs massivementparallèles sont les seuls moyens d’y parvenir. Néanmoins, il y a un prix à payercompte tenu des topologies matérielles de plus en plus complexes, tant en terme de réseauque de hiérarchie mémoire. La question de la localité des données devient ainsi centrale :comment réduire la distance entre une entité logicielle et les données auxquelles elle doitaccéder ? Le placement d’applications est un des leviers permettant de traiter ce problème.Dans cette thèse, nous présentons l’algorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, c’est-à-dire au lancement de l’application, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localité des données dans le cadre d’un algorithme d’équilibrage de charge. Les différentesapproches abordées sont validées par des expériences réalisées tant sur des codesd’évaluation de performances que sur des applications réelles

    Matching communication pattern with underlying hardware architecture

    Get PDF
    International audienceMATCHING COMMUNICATION PATTERN WITH UNDERLYING HARDWARE ARCHITECTUR

    An Overview of Process Mapping Techniques and Algorithms in High-Performance Computing

    Get PDF
    International audienceDue to the advent of modern hardware architectures of high-performance comput- ers, the way the parallel applications are laid out is of paramount importance for performance. This chapter surveys several techniques and algorithms that efficiently address this issue: the mapping of the application's virtual topology (for instance its communication pattern) onto the physical topology. Using such strategy enables to improve the application overall execution time significantly. The chapter concludes by listing a series of open issues and problems

    A design method for supporting the development and integration of ARTful global schedulers into multiple programming models

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2019.Plataformas de execução em ambientes de alto desempenho estão tornando-se cada vez mais diversas com o desenvolvimento de novas arquiteturas e ferramentas para se beneficiar do paralelismo inerente das aplicações. Estas novas opções de ferramentas oferecem possibilidades de melhoria no desempenho de aplicações científicas e de engenharia. A crescente gama de plataformas torna mais complexa a distribuição das tarefas da aplicação no ambiente, passo que deve ser gerido pelos sistemas de execução e seus balanceadores de carga, de modo a não prejudicar a portabilidade da aplicação. No entanto, o desenvolvimento e implantação de novos escalonadores de tarefas em sistemas amplamente utilizados na indústria como OpenMP e MPI sofrem com a falta de suporte de frameworks. Este trabalho propõe uma biblioteca, MOGSLib, para auxiliar o desenvolvimento e implantação de escalonadores globais em diferentes sistemas de execução de alto desempenho. MOGSLib aplica abstrações reutilizáveis, independentes e testáveis na forma de classes template em C++ para representar as políticas de escalonamento, sua relação com o sistema de execução e a taxonomia da solução de escalonamento. Nós avaliamos o sobrecusto de nossas estratégias portáveis em comparação com suas versões nativas nos sistemas de execução em dois ambientes distintos, os sistemas Charm++ e OpenMP. Em nosso experimentos, conseguimos dar suporte à definição de estratégias de escalonamento do usuário e sua implantação como escalonadores de laço na LibGOMP e balanceadores de carga no Charm++ através da seleção de implementação das abstrações que os compõem. Nós verificamos que nossas versões dos escalonadores oferecem tempos de execução de aplicação equivalentes aos balanceadores nativos para a classe de escalonadores centralizados e cientes de carga aplicados em kernels de dinâmica molecular. Por fim, a flexibilidade de incorporar funcionalidades e políticas de escalonamento do usuário em sistemas de execução com modificações limitadas nos códigos de sistemas de execução mostra que é possível construir balanceadores de carga flexíveis com pouco sobrecusto até para ambientes de alto desempenho.Abstract: Execution platforms for high performance computing are becoming diverse as a result of new architectures and tools to benefit from the parallel behavior of applications. These new options showcase performance enhancing opportunities for scientific and engineering applications. The execution platform diversity and the mapping of an application's tasks must be implicitly handled by runtime systems and their global schedulers as to enable the application performance and implementation portability. However, the development and integration of novel global schedulers into industry standards systems like OpenMP and MPI lack framework support. This work proposes a library, MOGSLib, to support the development and integration of global schedulers into different high performance runtime systems. MOGSLib employs reusable, independent and testable abstractions represented as C++ template structures to express scheduling policies, their relationship to runtime systems and the scheduling solution taxonomy. This approach allows a bottom-up development process based on the incremental composition of abstractions through template specializations. We evaluate the overhead of employing our portable policies in comparison to their system-native counterparts in two environments, the Charm++ and OpenMP systems. Throughout our experiments we achieved development support for user-defined scheduling policies that can be implanted both as loop schedulers in LibGOMP and load balancers in Charm++ through the selection of abstraction implementations. We leveraged the overhead by employing workload-aware policies on molecular dynamics kernels which resulted in equivalent application makespan for both native and the MOGSLib scheduler versions. Ultimately, the flexibility to incorporate user-defined structures and scheduling policies into runtime systems with limited alterations into runtime system code bases hint that the definition of flexible global schedulers is available with neglible overheads even for high performance environments

    Communication models insights meet simulations

    Get PDF
    International audienceIt is well-known that taking into account communications while scheduling jobs in large scale parallel computing platforms is a crucial issue. In modern hierarchical platforms, communication times are highly different when occurring inside a cluster or between clusters. Thus, allocating the jobs taking into account locality constraints is a key factor for reaching good performances. However, several theoretical results prove that imposing such constraints reduces the solution space and thus, possibly degrades the performances. In practice, such constraints simplify implementations and most often lead to better results. Our aim in this work is to bridge theoretical and practical intuitions, and check the differences between constrained and unconstrained schedules (namely with respect to locality and node contiguity) through simulations. We have developped a generic tool, using SimGrid as the base simulator, enabling interactions with external batch schedulers to evaluate their scheduling policies. The results confirm that insights gained through theoretical models are ill-suited to current architectures and should be reevaluated

    Adaptive Data Migration in Load-Imbalanced HPC Applications

    Get PDF
    Distributed parallel applications need to maximize and maintain computer resource utilization and be portable across different machines. Balanced execution of some applications requires more effort than others because their data distribution changes over time. Data re-distribution at runtime requires elaborate schemes that are expensive and may benefit particular applications. This dissertation discusses a solution for HPX applications to monitor application execution with APEX and use AGAS migration to adaptively redistribute data and load balance applications at runtime to improve application performance and scaling behavior. This dissertation provides evidence for the practicality of using the Active Global Address Space as is proposed by the ParalleX model and implemented in HPX. It does so by using migration for the transparent moving of objects at runtime and using the Autonomic Performance Environment for eXascale library with experiments that run on homogeneous and heterogeneous machines at Louisiana State University, CSCS Swiss National Supercomputing Centre, and National Energy Research Scientific Computing Center

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal
    corecore