46 research outputs found

    PaSTeL. Une implantation parallèle de la STL pour les architectures multi-coeurs : une analyse des performances

    No full text
    International audienceDans cet article, nous proposons la biblothèque PaSTeL, implémentation parallèle d'une partie de la STL, biblothèque standard du langage C++. PaSTeL propose à la fois un modèle de programmation pour la construction d'algorithmes parallèles, mais également un modèle d'exécution basé sur du vol de travail. Une attention toute particulière a été portée sur l'utilisation de mécanismes optimisés de synchronisation et d'activation des threads. Les performances de PaSTeL sont évaluées sur une machine de bureau avec un processeur à deux coeurs, mais également avec une machine disposant de 16 coeurs. On notera que les performances de PaSTeL sont supérieures à celles d'autres implémentations de la STL même pour des petites exécutions sur des petits jeux de données

    PaSTeL. Une implantation parallèle de la STL pour les architectures multi-coeurs : une analyse des performances

    Get PDF
    International audienceDans cet article, nous proposons la biblothèque PaSTeL, implémentation parallèle d'une partie de la STL, biblothèque standard du langage C++. PaSTeL propose à la fois un modèle de programmation pour la construction d'algorithmes parallèles, mais également un modèle d'exécution basé sur du vol de travail. Une attention toute particulière a été portée sur l'utilisation de mécanismes optimisés de synchronisation et d'activation des threads. Les performances de PaSTeL sont évaluées sur une machine de bureau avec un processeur à deux coeurs, mais également avec une machine disposant de 16 coeurs. On notera que les performances de PaSTeL sont supérieures à celles d'autres implémentations de la STL même pour des petites exécutions sur des petits jeux de données

    Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures

    Get PDF
    International audienceSUMMARY Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far build on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator of distributed systems. This approach allows to obtain performance predictions of classical dense linear algebra kernels accurate within a few percents and in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not. Additionally, it allows to conduct robust and extensive scheduling studies in a controlled environment whose characteristics are very close to real platforms while having reproducible behavior

    Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers

    Get PDF
    International audienceThe ever growing complexity and scale of parallel architectures imposes to rewrite classical monolithic HPC scientific applications and libraries as their portability and performance optimization only comes at a prohibitive cost. There is thus a recent and general trend in using instead a modular approach where numerical algorithms are written at a high level independently of the hardware architecture as Directed Acyclic Graphs (DAG) of tasks. A task-based runtime system then dynamically schedules the resulting DAG on the different computing resources, automatically taking care of data movement and taking into account the possible speed heterogeneity and variability. Evaluating the performance of such complex and dynamic systems is extremely challenging especially for irregular codes. In this article, we explain how we crafted a faithful simulation, both in terms of performance and memory usage, of the behavior of qr_mumps, a fully-featured sparse linear algebra library, on multi-core architectures. In our approach, the target high-end machines are calibrated only once to derive sound performance models. These models can then be used at will to quickly predict and study in a reproducible way the performance of such irregular and resource-demanding applications using solely a commodity laptop

    Narrowing the Search Space of Applications Mapping on Hierarchical Topologies

    Get PDF
    To be held in conjunction with SC21International audienceProcessor architectures at exascale and beyond are expected to continue to suffer from nonuniform access issues to in-die and node-wide shared resources. Mapping applications onto these resource hierarchies is an on-going performance concern, requiring specific care for increasing locality and resource sharing but also for ensuing contention. Application-agnostic approaches to search efficient mappings are based on heuristics. Indeed, the size of the search space makes it impractical to find optimal solutions nowadays and will only worsen as the complexity of computing systems increases over time. In this paper we leverage the hierarchical structure of modern compute nodes to reduce the size of this search space. As a result, we facilitate the search for optimal mappings and improve the ability to evaluate existing heuristics.Using widely known benchmarks, we show that permuting thread and process placement per node of a hierarchical topology leads to similar performances. As a result, the mapping search space can be narrowed down by several orders of magnitude when performing exhaustive search. This reduced search space will enable the design of new approaches, including exhaustive search or automatic exploration. Moreover, it provides new insights into heuristic-based approaches, including better upper bounds and smaller solution space

    Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support

    Get PDF
    Proper modeling of collective communications is essential for understanding the behavior of medium-to-large scale parallel applications, and even minor deviations in implementation can adversely affect the prediction of real-world performance. We propose a hybrid network model extending LogP based approaches to account for topology and contention in high-speed TCP networks. This model is validated within SMPI, an MPI implementation provided by the SimGrid simulation toolkit. With SMPI, standard MPI applications can be compiled and run in a simulated network environment, and traces can be captured without incurring errors from tracing overheads or poor clock synchronization as in physical experiments. SMPI provides features for simulating applications that require large amounts of time or resources, including selective execution, ram folding, and off-line replay of execution traces. We validate our model by comparing traces produced by SMPI with those from other simulation platforms, as well as real world environments.Une bonne modélisation des communications collective est indispensable à la compréhension des performances des applications parallèles et des différences, même minimes, dans leur implémentation peut drastiquement modifier les performances escomptées. Nous proposons un modèle réseau hybrid étendant les approches de type LogP mais permettant de rendre compte de la topologie et de la contention pour les réseaux hautes performances utilisant TCP. Ce modèle est mis en oeuvre et validé au sein de SMPI, une implémentation de MPI fournie par l'environnement SimGrid. SMPI permet de compiler et d'exécuter sans modification des applications MPI dans un environnement simulé. Il est alors possible de capturer des traces sans l'intrusivité ni les problème de synchronisation d'horloges habituellement rencontrés dans des expériences réelles. SMPI permet également de simuler des applications gourmandes en mémoire ou en temps de calcul à l'aide de techniques telles l'exécution sélective, le repliement mémoire ou le rejeu hors-ligne de traces d'exécutions. Nous validons notre modèle en comparant les traces produites à l'aide de SMPI avec celles de traces d'exécution réelle. Nous montrons le gain obtenu en les comparant également à celles obtenues avec des modèles plus classiques utilisés dans des outils concurrents

    Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures

    Get PDF
    International audienceMulti-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not

    ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

    Full text link
    As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications -- XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes

    The Mont-Blanc prototype: an alternative approach for high-performance computing systems

    Get PDF
    High-performance computing (HPC) is recognized as one of the pillars for further advance of science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging challenges in order to reach Exascale level of performance,which is expected by the year 2020. The much larger embedded and mobile market allows for rapid development of IP blocks, and provides more flexibility in designing an application-specific SoC, in turn giving possibility in balancing performance, energy-efficiency and cost. In the Mont-Blanc project, we advocate for HPC systems be built from such commodity IP blocks, currently used in embedded and mobile SoCs. As a first demonstrator of such approach, we present the Mont-Blanc prototype; the first HPC system built with commodity SoCs, memories, and NICs from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling and integration solutions. We present the system’s architecture, and evaluation including both performance and energy efficiency. Further, we compare the system’s abilities against a production level supercomputer. At the end, we discuss parallel scalability, and estimate the maximum scalability point of this approach across a set of HPC applications.Postprint (published version
    corecore