1,234 research outputs found

    Scheduling Dynamic OpenMP Applications over Multicore Architectures

    Get PDF
    International audienceApproaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization

    A design method for supporting the development and integration of ARTful global schedulers into multiple programming models

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2019.Plataformas de execução em ambientes de alto desempenho estão tornando-se cada vez mais diversas com o desenvolvimento de novas arquiteturas e ferramentas para se beneficiar do paralelismo inerente das aplicações. Estas novas opções de ferramentas oferecem possibilidades de melhoria no desempenho de aplicações científicas e de engenharia. A crescente gama de plataformas torna mais complexa a distribuição das tarefas da aplicação no ambiente, passo que deve ser gerido pelos sistemas de execução e seus balanceadores de carga, de modo a não prejudicar a portabilidade da aplicação. No entanto, o desenvolvimento e implantação de novos escalonadores de tarefas em sistemas amplamente utilizados na indústria como OpenMP e MPI sofrem com a falta de suporte de frameworks. Este trabalho propõe uma biblioteca, MOGSLib, para auxiliar o desenvolvimento e implantação de escalonadores globais em diferentes sistemas de execução de alto desempenho. MOGSLib aplica abstrações reutilizáveis, independentes e testáveis na forma de classes template em C++ para representar as políticas de escalonamento, sua relação com o sistema de execução e a taxonomia da solução de escalonamento. Nós avaliamos o sobrecusto de nossas estratégias portáveis em comparação com suas versões nativas nos sistemas de execução em dois ambientes distintos, os sistemas Charm++ e OpenMP. Em nosso experimentos, conseguimos dar suporte à definição de estratégias de escalonamento do usuário e sua implantação como escalonadores de laço na LibGOMP e balanceadores de carga no Charm++ através da seleção de implementação das abstrações que os compõem. Nós verificamos que nossas versões dos escalonadores oferecem tempos de execução de aplicação equivalentes aos balanceadores nativos para a classe de escalonadores centralizados e cientes de carga aplicados em kernels de dinâmica molecular. Por fim, a flexibilidade de incorporar funcionalidades e políticas de escalonamento do usuário em sistemas de execução com modificações limitadas nos códigos de sistemas de execução mostra que é possível construir balanceadores de carga flexíveis com pouco sobrecusto até para ambientes de alto desempenho.Abstract: Execution platforms for high performance computing are becoming diverse as a result of new architectures and tools to benefit from the parallel behavior of applications. These new options showcase performance enhancing opportunities for scientific and engineering applications. The execution platform diversity and the mapping of an application's tasks must be implicitly handled by runtime systems and their global schedulers as to enable the application performance and implementation portability. However, the development and integration of novel global schedulers into industry standards systems like OpenMP and MPI lack framework support. This work proposes a library, MOGSLib, to support the development and integration of global schedulers into different high performance runtime systems. MOGSLib employs reusable, independent and testable abstractions represented as C++ template structures to express scheduling policies, their relationship to runtime systems and the scheduling solution taxonomy. This approach allows a bottom-up development process based on the incremental composition of abstractions through template specializations. We evaluate the overhead of employing our portable policies in comparison to their system-native counterparts in two environments, the Charm++ and OpenMP systems. Throughout our experiments we achieved development support for user-defined scheduling policies that can be implanted both as loop schedulers in LibGOMP and load balancers in Charm++ through the selection of abstraction implementations. We leveraged the overhead by employing workload-aware policies on molecular dynamics kernels which resulted in equivalent application makespan for both native and the MOGSLib scheduler versions. Ultimately, the flexibility to incorporate user-defined structures and scheduling policies into runtime systems with limited alterations into runtime system code bases hint that the definition of flexible global schedulers is available with neglible overheads even for high performance environments

    Topology-aware equipartitioning with coscheduling on multicore systems

    Get PDF
    Over the last decade, multicore architectures have become omnipresent. Today, they are used in the whole product range from server systems to handheld computers. The deployed software still undergoes the slow transition from sequential to parallel. This transition, however, is gaining more and more momentum due to the increased availability of more sophisticated parallel programming environments, which replace the some-times crude results of ad-hoc parallelization. Combined with the ever increasing complexity of multicore architectures, this results in a scheduling problem that is different from what it has been, because features such as non-uniform memory access, shared caches, or simultaneous multithreading have to be considered. In this paper, we compare different ways of scheduling multiple parallel applications. Due to emerging parallel programming environments, we only consider malleable applications, i. e., applications where the parallelism degree can be changed on the fly. We propose a topology-aware scheduling scheme that combines equipartitioning and coscheduling. It does not suffer from the drawbacks of the individual concepts and also allows to run applications at different degrees of parallelisms without compromising fairness. We find that topology-awareness increases performance for all evaluated workloads. The combination with coscheduling is more sensitive towards the executed workloads. However, the gained versatility allows new use cases to be explored, which were not possible before

    LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

    Get PDF
    LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft
    corecore