12 research outputs found

    StarVZ: Performance Analysis of Task-Based Parallel Applications

    Get PDF
    High-performance computing (HPC) applications enable the solution of compute-intensive problems in feasible time. Among many HPC paradigms, task-based programming has gathered community attention in recent years. This paradigm enables constructing an HPC application using a more declarative approach, structuring it in a direct acyclic graph (DAG). The performance evaluation of these applications is as hard as in any other programming paradigm. Understanding how to analyze these applications, employing the DAG and runtime metrics, presents opportunities to improve its performance. This article describes the StarVZ R-package available on CRAN for performance analysis of task-based applications. StarVZ enables transforms runtime trace data into different vi-sualizations of the application behavior. An analyst can understand their applications' performance limitations and compare multiple executions. StarVZ has been successfully applied to several study-cases, showing its applicability in a number of scenarios

    Stratégies de distribution d'applications basées sur des tâches sur des plates-formes hétérogènes

    No full text
    HPC platforms are vastly heterogeneous because of intra-node resources like accelerators and inter-node heterogeneity when there are different machines. The applications that use these resources are already very complex, with many distinct operations and phases, and developers must consider all sets of diverse computational resources. The task-based programming paradigm is a modern alternative to increase the computational efficiency of intra-node heterogeneous resources while maintaining relative development simplicity. The application defines a Direct Acyclic Graph of tasks and a dynamic runtime asynchronously schedules them to the resources respecting task dependencies. However, handling different types of nodes requires new specific strategies to distribute an application in this asynchronous and heterogeneous environment. This thesis studies the problem of distributing this type of complex task-based applications over those diverse system-level resources, proposing strategies to divide their load correctly, considering computational heterogeneity, multiple-phase asynchronism, and adaptability. This work uses real applications to validate its results with experiments conducted in large testbeds and a supercomputer. The thesis' main contributions are the following. (i) Strategies for distributing a single application operation considering the trade-off of communication, critical path, and heterogeneous load balancing. (ii) A set of optimizations for improving asynchronous phase overlap in applications. (iii) A methodology for computing the relative power of each phase on each heterogeneous group of nodes considering the phase overlap. (iv) A distribution strategy for an antecedent phase reducing communication redistribution. (v) A strategy for the application dynamically adapt during execution to decide the best subset of nodes for each phase. (vi) An extended comprehensive analysis of the experiments that include a methodology to analyze the application progress per node resilient to heterogeneity and that can cluster nodes with similar behavior. Ultimately, this thesis is a step toward efficiently exploiting and combining any of these diverse resources, using them to handle applications' distinct necessities better, and improving their overall performance.Les plates-formes HPC sont amplement hétérogènes en raison des ressources intra-nœuds, comme les accélérateurs, et de l'hétérogénéité inter-nœuds lorsqu'il y a différentes machines. Les applications qui utilisent ces ressources sont déjà très complexes, avec de nombreuses opérations et phases distinctes, et les développeurs doivent prendre en compte tous les diverses ressources informatiques. Le paradigme de la programmation basée sur les tâches est une alternative moderne pour augmenter l'efficacité de calcul des ressources hétérogènes intra-nœud tout en maintenant une relative simplicité de développement. L'application définit un graphe acyclique direct de tâches et un moteur d'exécution dynamique les planifie de manière asynchrone sur les ressources en respectant les dépendances des tâches. Cependant, la gestion de différents types de nœuds nécessite de nouvelles stratégies spécifiques pour distribuer une application dans cet environnement asynchrone et hétérogène. Cette thèse étudie le problème de la distribution de ce type d'applications complexes basées sur des tâches sur ces diverses ressources au niveau du système, en proposant des stratégies pour diviser leur charge correctement, en tenant compte de l'hétérogénéité de calcul, de l'asynchronisme à phases multiples et de l'adaptabilité. Ce travail utilise des applications réelles pour valider ses résultats avec des expériences menées dans de grands bancs d'essai et un superordinateur. Les principales contributions de la thèse sont les suivantes. (i) Stratégies de distribution d'une opération d'application unique en tenant compte du compromis entre la communication, le chemin critique et l'équilibrage de la charge hétérogène. (ii) Un ensemble d'optimisations pour améliorer le chevauchement des phases asynchrones dans les applications. (iii) Une méthodologie pour calculer la puissance relative de chaque phase sur chaque groupe hétérogène de nœuds en tenant compte du chevauchement des phases. (iv) Une stratégie de distribution pour une phase antérieure réduisant la redistribution des communications. (v) Une stratégie d'adaptation dynamique de l'application pendant l'exécution pour décider du meilleur sous-ensemble de nœuds pour chaque phase. (vi) Une analyse complète des expériences qui inclut une méthodologie pour analyser la progression de l'application par nœud qui résiste à l'hétérogénéité et qui peut regrouper les nœuds ayant un comportement similaire. En fin de compte, cette thèse constitue une étape vers l'exploitation et la combinaison efficaces de ces diverses ressources, en les utilisant pour mieux gérer les besoins distincts des applications et améliorer leurs performances globales

    Strategies for Distributing Task-Based Applications on Heterogeneous Platforms

    No full text
    Les plates-formes HPC sont amplement hétérogènes en raison des ressources intra-nœuds, comme les accélérateurs, et de l'hétérogénéité inter-nœuds lorsqu'il y a différentes machines. Les applications qui utilisent ces ressources sont déjà très complexes, avec de nombreuses opérations et phases distinctes, et les développeurs doivent prendre en compte tous les diverses ressources informatiques. Le paradigme de la programmation basée sur les tâches est une alternative moderne pour augmenter l'efficacité de calcul des ressources hétérogènes intra-nœud tout en maintenant une relative simplicité de développement. L'application définit un graphe acyclique direct de tâches et un moteur d'exécution dynamique les planifie de manière asynchrone sur les ressources en respectant les dépendances des tâches. Cependant, la gestion de différents types de nœuds nécessite de nouvelles stratégies spécifiques pour distribuer une application dans cet environnement asynchrone et hétérogène. Cette thèse étudie le problème de la distribution de ce type d'applications complexes basées sur des tâches sur ces diverses ressources au niveau du système, en proposant des stratégies pour diviser leur charge correctement, en tenant compte de l'hétérogénéité de calcul, de l'asynchronisme à phases multiples et de l'adaptabilité. Ce travail utilise des applications réelles pour valider ses résultats avec des expériences menées dans de grands bancs d'essai et un superordinateur. Les principales contributions de la thèse sont les suivantes. (i) Stratégies de distribution d'une opération d'application unique en tenant compte du compromis entre la communication, le chemin critique et l'équilibrage de la charge hétérogène. (ii) Un ensemble d'optimisations pour améliorer le chevauchement des phases asynchrones dans les applications. (iii) Une méthodologie pour calculer la puissance relative de chaque phase sur chaque groupe hétérogène de nœuds en tenant compte du chevauchement des phases. (iv) Une stratégie de distribution pour une phase antérieure réduisant la redistribution des communications. (v) Une stratégie d'adaptation dynamique de l'application pendant l'exécution pour décider du meilleur sous-ensemble de nœuds pour chaque phase. (vi) Une analyse complète des expériences qui inclut une méthodologie pour analyser la progression de l'application par nœud qui résiste à l'hétérogénéité et qui peut regrouper les nœuds ayant un comportement similaire. En fin de compte, cette thèse constitue une étape vers l'exploitation et la combinaison efficaces de ces diverses ressources, en les utilisant pour mieux gérer les besoins distincts des applications et améliorer leurs performances globales.HPC platforms are vastly heterogeneous because of intra-node resources like accelerators and inter-node heterogeneity when there are different machines. The applications that use these resources are already very complex, with many distinct operations and phases, and developers must consider all sets of diverse computational resources. The task-based programming paradigm is a modern alternative to increase the computational efficiency of intra-node heterogeneous resources while maintaining relative development simplicity. The application defines a Direct Acyclic Graph of tasks and a dynamic runtime asynchronously schedules them to the resources respecting task dependencies. However, handling different types of nodes requires new specific strategies to distribute an application in this asynchronous and heterogeneous environment. This thesis studies the problem of distributing this type of complex task-based applications over those diverse system-level resources, proposing strategies to divide their load correctly, considering computational heterogeneity, multiple-phase asynchronism, and adaptability. This work uses real applications to validate its results with experiments conducted in large testbeds and a supercomputer. The thesis' main contributions are the following. (i) Strategies for distributing a single application operation considering the trade-off of communication, critical path, and heterogeneous load balancing. (ii) A set of optimizations for improving asynchronous phase overlap in applications. (iii) A methodology for computing the relative power of each phase on each heterogeneous group of nodes considering the phase overlap. (iv) A distribution strategy for an antecedent phase reducing communication redistribution. (v) A strategy for the application dynamically adapt during execution to decide the best subset of nodes for each phase. (vi) An extended comprehensive analysis of the experiments that include a methodology to analyze the application progress per node resilient to heterogeneity and that can cluster nodes with similar behavior. Ultimately, this thesis is a step toward efficiently exploiting and combining any of these diverse resources, using them to handle applications' distinct necessities better, and improving their overall performance

    Estratégias para análise do desempenho de memória em nível de runtime para aplicações baseadas em tarefas sobre plataformas heterogêneas

    No full text
    Programming parallel applications for heterogeneous High Performance Computing platforms is easier when using the task-based programming paradigm, where a Direct Acyclic Graph (DAG) of tasks models the application behavior. The simplicity exists because a runtime, like StarPU, takes care of many activities usually carried out by the application developer, such as task scheduling, load balancing, and memory management. This memory management refers to the runtime responsibility for handling memory operations, like copying the necessary data to the location where a given task is scheduled to execute. Poor scheduling or lack of appropriate information may lead to inadequate memory management by the runtime. Discover if an application presents memory-related performance problems is complex. The task-based applications’ and runtimes’ programmers would benefit from specialized performance analysis methodologies that check for possible memory management problems. In this way, this work proposes methods and tools to investigate heterogeneous CPU-GPU-Disk memory management of the StarPU runtime, a popular task-based middleware for HPC applications. The base of these methods is the execution traces that are collected by the runtime. These traces provide information about the runtime decisions and the system performance; however, a simple application can have huge amounts of trace data stored that need to be analyzed and converted to useful metrics or visualizations. The use of a methodology specific to task-based applications could lead to a better understanding of memory behavior and possible performance optimizations. The proposed strategies are applied on three different problems, a dense Cholesky solver, a CFD simulation, and a sparse QR factorization. On the dense Cholesky solver, the strategies found a problem on StarPU that a correction leads to 66% performance improvement. On the CFD simulation, the strategies guided the insertion of extra information on the DAG and data, leading to performance gains of 38%. These results indicate the effectiveness of the proposed analysis methodology in problems identification that leads to relevant optimizations.A programação de aplicações paralelas para plataformas heterogêneas de Computação de alto desempenho é mais fácil ao usar o paradigma baseado em tarefas, em que um Grafo Acíclico Dirigido (DAG) de tarefas descreve o comportamento da aplicação. A simplicidade existe porque um runtime, como o StarPU, fica responsável por diversas atividades normalmente executadas pelo desenvolvedor da aplicação, como escalonamento de tarefas, balanceamento de carga e gerenciamento de memória. Este gerenciamento de memória refere-se as operações de dados, como por exemplo, copiar os dados necessários para o local onde uma determinada tarefa está escalonada para execução. Decisões ruins de escalonamento ou a falta de informações apropriadas podem levar a um gerenciamento de memória inadequado pelo runtime. Descobrir se uma aplicação esta apresentando problemas de desempenho por erros de memória é complexo. Os programadores de aplicações e runtimes baseadas em tarefas se beneficiariam de metodologias especializadas de análise de desempenho que verificam possíveis problemas no gerenciamento de memória. Desta maneira, este trabalho apresenta métodos para investigar o gerenciamento da memória entre CPU-GPU-disco de recursos heterogêneos do runtime StarPU, um popular middleware baseado em tarefas para aplicações HPC. A base desses métodos é o rastreamento de execução coletado pelo StarPU. Esses rastros fornecem informações sobre as decisões do escalonamento e do desempenho do sistema que precisam ser analisados e convertidos em métricas ou visualizações úteis. O uso de uma metodologia específica para aplicações baseadas em tarefas pode levar a um melhor entendimento do comportamento da memória e para possíveis otimizações de desempenho. As estratégias propostas foram aplicadas em três diferentes problemas, um solucionador da fatoração de Cholesky denso, uma simulação CFD, e uma fatoração QR esparsa. No caso do Cholesky denso, as estratégias encontraram um problema no StarPU que a correção levou a ganhos de 66% de desempenho. No caso da simulação CFD, as estratégias guiaram a inserção de informação extra no DAG levando a ganhos de 38%. Estes resultados mostram a efetividade dos métodos propostos na identificação de problemas que levam a otimizações

    Estratégias para a distribuição de aplicações baseadas em tarefas em plataformas heterogêneas

    No full text
    HPC platforms are vastly heterogeneous because of intra-node resources like accelerators and inter-node heterogeneity when there are different machines. The applications that use these re sources are already very complex, with many distinct operations and phases, and developers must consider all sets of diverse computational resources. The task-based programming paradigm is a modern alternative to increase the computational efficiency of intra-node heterogeneous resources while maintaining relative development simplicity. The application defines a Directed Acyclic Graph of tasks and a dynamic runtime asynchronously schedules them to the resources respecting task dependencies. However, handling different types of nodes requires new specific strategies to distribute an application in this asynchronous and heterogeneous environment. This thesis studies the problem of distributing this type of complex task-based applications over those diverse system-level resources, proposing strategies to divide their load correctly, considering computational heterogeneity, multiple-phase asynchronism, and adaptability. This work uses real applications to validate its results with experiments conducted in large testbeds and a su percomputer. The thesis’ main contributions are the following. (i) Strategies for distributing a single application operation considering the trade-off of communication, critical path, and heterogeneous load balancing. (ii) A set of optimizations for improving asynchronous phase overlap in applications. (iii) A methodology for computing the relative power of each phase on each heterogeneous group of nodes considering the phase overlap. (iv) A distribution strategy for an antecedent phase reducing communication redistribution. (v) A strategy for the application dynamically adapts during execution to decide the best subset of nodes for each phase. (vi) An extended comprehensive analysis of the experiments that include a methodology to analyze the application progress per node resilient to heterogeneity and that can cluster nodes with similar behavior. Ultimately, this thesis is a step toward efficiently exploiting and combining any of these diverse resources, using them to handle applications’ distinct necessities better, and improving their overall performance.As plataformas de HPC são heterogêneas devido aos recursos intra-nó, como os aceleradores, e a heterogeneidade inter-nó, quando existem máquinas diferentes. As aplicações que utilizam estes recursos já são complexas, com muitas operações e fases distintas, e os programadores podem considerar todos os conjuntos de recursos computacionais diversos. O paradigma de programação baseado em tarefas é uma alternativa moderna para aumentar a eficiência computacional dos recursos heterogêneos intra-nó, mantendo uma relativa simplicidade de desenvolvimento. A aplicação define um grafo acíclico dirigido de tarefas e um runtime os escalona de forma assíncrona para os recursos, respeitando as dependências das tarefas. No entanto, o tratamento de diferentes tipos de nós requer novas estratégias específicas para distribuir uma aplicação neste ambiente assíncrono e heterogêneo. Esta tese estuda o problema da distribuição deste tipo de aplicações complexas baseadas em tarefas nesses recursos heterogêneos ao nível do sistema, propondo estratégias para dividir corretamente a sua carga, considerando a heterogeneidade computacional, o assincronismo em múltiplas fases e a adaptabilidade. Este trabalho utiliza aplicações reais para validar os resultados com experimentos realizados em grandes testbeds e num supercomputador. As principais contribuições da tese são as seguintes. (i) Estratégias para distribuir uma única operação considerando a razão de comunicação, caminho crítico e balance amento de carga heterogêneo. (ii) Um conjunto de optimizações para melhorar a sobreposição de fases assíncronas em aplicações. (iii) Uma metodologia para calcular a velocidade relativa de cada fase em cada grupo heterogêneo de nós, tendo em conta a sobreposição de fases. (iv) Uma estratégia de distribuição para uma fase antecedente que reduza a redistribuição de comunica ções. (v) Uma estratégia para a aplicação se adaptar dinamicamente durante a execução para decidir o melhor subconjunto de nós para cada fase. (vi) Uma análise abrangente e aprofun dada dos experimentos que inclui uma metodologia para analisar o progresso da aplicação por nó considerando a heterogeneidade e que pode agrupar nós com comportamento semelhante. Finalmente, esta tese é um passo no sentido de explorar e combinar eficientemente qualquer um destes diversos recursos, utilizando-os para melhor acomodar as necessidades distintas das aplicações e melhorar o seu desempenho

    Estratégias para análise do desempenho de memória em nível de runtime para aplicações baseadas em tarefas sobre plataformas heterogêneas

    No full text
    Programming parallel applications for heterogeneous High Performance Computing platforms is easier when using the task-based programming paradigm, where a Direct Acyclic Graph (DAG) of tasks models the application behavior. The simplicity exists because a runtime, like StarPU, takes care of many activities usually carried out by the application developer, such as task scheduling, load balancing, and memory management. This memory management refers to the runtime responsibility for handling memory operations, like copying the necessary data to the location where a given task is scheduled to execute. Poor scheduling or lack of appropriate information may lead to inadequate memory management by the runtime. Discover if an application presents memory-related performance problems is complex. The task-based applications’ and runtimes’ programmers would benefit from specialized performance analysis methodologies that check for possible memory management problems. In this way, this work proposes methods and tools to investigate heterogeneous CPU-GPU-Disk memory management of the StarPU runtime, a popular task-based middleware for HPC applications. The base of these methods is the execution traces that are collected by the runtime. These traces provide information about the runtime decisions and the system performance; however, a simple application can have huge amounts of trace data stored that need to be analyzed and converted to useful metrics or visualizations. The use of a methodology specific to task-based applications could lead to a better understanding of memory behavior and possible performance optimizations. The proposed strategies are applied on three different problems, a dense Cholesky solver, a CFD simulation, and a sparse QR factorization. On the dense Cholesky solver, the strategies found a problem on StarPU that a correction leads to 66% performance improvement. On the CFD simulation, the strategies guided the insertion of extra information on the DAG and data, leading to performance gains of 38%. These results indicate the effectiveness of the proposed analysis methodology in problems identification that leads to relevant optimizations.A programação de aplicações paralelas para plataformas heterogêneas de Computação de alto desempenho é mais fácil ao usar o paradigma baseado em tarefas, em que um Grafo Acíclico Dirigido (DAG) de tarefas descreve o comportamento da aplicação. A simplicidade existe porque um runtime, como o StarPU, fica responsável por diversas atividades normalmente executadas pelo desenvolvedor da aplicação, como escalonamento de tarefas, balanceamento de carga e gerenciamento de memória. Este gerenciamento de memória refere-se as operações de dados, como por exemplo, copiar os dados necessários para o local onde uma determinada tarefa está escalonada para execução. Decisões ruins de escalonamento ou a falta de informações apropriadas podem levar a um gerenciamento de memória inadequado pelo runtime. Descobrir se uma aplicação esta apresentando problemas de desempenho por erros de memória é complexo. Os programadores de aplicações e runtimes baseadas em tarefas se beneficiariam de metodologias especializadas de análise de desempenho que verificam possíveis problemas no gerenciamento de memória. Desta maneira, este trabalho apresenta métodos para investigar o gerenciamento da memória entre CPU-GPU-disco de recursos heterogêneos do runtime StarPU, um popular middleware baseado em tarefas para aplicações HPC. A base desses métodos é o rastreamento de execução coletado pelo StarPU. Esses rastros fornecem informações sobre as decisões do escalonamento e do desempenho do sistema que precisam ser analisados e convertidos em métricas ou visualizações úteis. O uso de uma metodologia específica para aplicações baseadas em tarefas pode levar a um melhor entendimento do comportamento da memória e para possíveis otimizações de desempenho. As estratégias propostas foram aplicadas em três diferentes problemas, um solucionador da fatoração de Cholesky denso, uma simulação CFD, e uma fatoração QR esparsa. No caso do Cholesky denso, as estratégias encontraram um problema no StarPU que a correção levou a ganhos de 66% de desempenho. No caso da simulação CFD, as estratégias guiaram a inserção de informação extra no DAG levando a ganhos de 38%. Estes resultados mostram a efetividade dos métodos propostos na identificação de problemas que levam a otimizações

    Visual Performance Analysis of Memory Behavior in a Task-Based Runtime on Hybrid Platforms

    Get PDF
    International audienceProgramming parallel applications for heterogeneous HPC platforms is much more straightforward when using the task-based programming paradigm. The simplicity exists because a runtime takes care of many activities usually carried out by the application developer, such as task mapping, load balancing, and memory management operations. In this paper, we present a visualization-based performance analysis methodology to investigate the CPU-GPU-Disk memory management of the StarPU runtime, a popular task-based middleware for HPC applications. We detail the design of novel graphical strategies that were fundamental to recognize performance problems in four study cases. We first identify poor management of data handles when GPU memory is saturated, leading to low application performance. Our experiments using the dense tiled-based Cholesky factorization show that our fix leads to performance gains of 66% and better scalability for larger input sizes. In the other three cases, we study scenarios where the main memory is insufficient to store all the application's data, forcing the runtime to store data out-of-core. Using our methodology, we pinpoint different behavior among schedulers and how we have identified a crucial problem in the application code regarding initial block placement, which leads to poor performance

    Summarizing task-based applications behavior over many nodes through progression clustering

    No full text
    International audienceVisualization strategies are a valuable tool in the performance evaluation of HPC applications. Although the traditional Gantt charts are a widespread and enlightening strategy, it presents scalability problems and may misguide the analysis by focusing on resource utilization alone. This paper proposes an overview strategy to indicate nodes of interest for further investigation with classical visualizations like Gantt charts. For this, it uses a progression metric that captures work done per node inferred from the task-based structure, a time-step clustering of those metrics to decrease redundant information, and a more scalable visualization technique. We demonstrate with six scenarios and two applications that such a strategy can indicate problematic nodes more straightforwardly while using the same visualization space. Also, we provide examples where it correctly captures application work progression, showing application problems earlier and as an easy way to compare nodes. At the same time that traditional methods are misleading

    StarVZ: Performance Analysis of Task-Based Parallel Applications

    Get PDF
    High-performance computing (HPC) applications enable the solution of compute-intensive problems in feasible time. Among many HPC paradigms, task-based programming has gathered community attention in recent years. This paradigm enables constructing an HPC application using a more declarative approach, structuring it in a direct acyclic graph (DAG). The performance evaluation of these applications is as hard as in any other programming paradigm. Understanding how to analyze these applications, employing the DAG and runtime metrics, presents opportunities to improve its performance. This article describes the StarVZ R-package available on CRAN for performance analysis of task-based applications. StarVZ enables transforms runtime trace data into different vi-sualizations of the application behavior. An analyst can understand their applications' performance limitations and compare multiple executions. StarVZ has been successfully applied to several study-cases, showing its applicability in a number of scenarios
    corecore