173 research outputs found

    Master of Science

    Get PDF
    thesisAt the beginning of the 21st century, it became apparent that the performance gains associated with continual die shrinks and the resulting increases in core central processing unit (CPU) speeds were beginning to flatten. This realization has gradually shifted the focus of CPU design away from single core speed increases and toward the idea of obtaining performance through increased concurrency. The resulting design paradigm has given us multi- and many-core CPUs, vector processing units, and more recently, programmable, massively parallel hardware coprocessors, such as graphics processing units from nVidia and Advanced Micro Devices, along with more recent general purpose devices such as Intel's "Knights Corner." One of the most significant resulting challenges in high-performance computing is to provide a framework in which the software development process is platform agnostic to its end users, while at the same time being capable of scaling efficiently on diverse hardware configurations. This thesis will present an improved approach for the analysis and scheduling of computational tasks within a heterogeneous hardware environment, while removing implementation details from end users. This will be presented within the context of the "Expressions" framework, a component within a computational fluid dynamics solver, known as "Wasatch," developed at the University of Utah

    Network Coding Parallelization Based on Matrix Operations for Multicore Architectures

    Get PDF

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link

    Scheduling strategies for parallel patterns on heterogeneous architectures

    Get PDF
    To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime systems can be used to facilitate the management of heterogeneous architectures. A runtime system can provide a significant performance boost while reducing the energy consumption, because it is aware of processors’ architectures and application’s requirements. We analyse how applications map onto hardware by inspecting built-in processor counters, and therefore build models to describe the observed behaviour. In this thesis, we discuss how parallel patterns, such as parallel for loops and pipelines, can be decomposed and efficiently executed on heterogeneous plat- forms. We propose several scheduling strategies aiming at reducing execution time and energy consumption. We demonstrate how applications can be run faster by mapping the application level parallelism onto the hardware process- ing units that best fit the application requirements, and by selecting the right task size. First, we devise a load balancing technique, that targets heterogeneous CPU and multi-GPU architectures. It monitors the relative speed of each processing unit, and distributes the remaining workload based on these relative speeds. By making all processing units to finish at same time, we avoid unnecessary waits between processors. Along with this load balancing technique, we propose a performance-sensitive partitioner that adapts the amount of computation offloaded to the accelerator for better performance and utilisation. We also present an accurate performance model for streaming applications, such as face recognition or object tracking. This model targets pipelined applications, as a series of stages, and performs a scalability analysis of each stage by using coarse and medium grain parallelism. Additionally, it also considers executing the stage on the GPU or not. By applying the model, we always find the best pipeline configuration among all possible, and get substantial performance and energy savings. All experiments in this thesis have been performed by using state-of-the-art hardware accelerators and benchmarks of the field of HPC. Specifically, we use the Rodinia and SHOC benchmark suites, for the evaluation of the parallel for partitioner. Moreover, we use the the ViVid application, along with tracking and SRAD applications from Rodinia Benchmark Suite, all of them are good candidates of vision applications. Finally, we rely on Intel Threading Building Blocks, the core engine of our schedulers; the Intel OpenCL SDK and CUDA SDK to offload computations to the GPU accelerators and Intel PCM library to monitor energy consumption and cache memory metrics.During the last decade, power consumption and energy efficiency have become key aspects in processor design. Nowadays, the power consumption is the principal limitation for further scaling of chip multiprocessors design (CMPs). In general, the research community agrees that current chip multiprocessor technology trends will not scale performance without an increase of power budget. Hardware design innovations as the recent Heterogeneous Architectures and Near Threshold Computing are needed to cope with the performance-power barrier. As a result of this, there has been a shift away from chip multiprocessors to heterogeneous processor architectures. Recently, we have witnessed an explosion in the availability of this kind of architectures. Many hardware vendors have released a number of heterogeneous processors to overcome the aforementioned limitations. However, software also requires changes to allow further performance scaling on these architectures. With the advent of heterogeneous architectures, hardware manufactures have impose the burden of explicit accelerator management on software developers. In general, programmers are used to sequential programming, but writing high-performance programs for heterogeneous architectures is a complex task. Programming for this kind of platforms requires the understanding of new hardware concepts, orchestration of different parallelism levels, the explicit management of different memory spaces and synchronisations between processing units, and finally the usage of low-level programming models such as OpenCL or CUDA. Moreover, heterogeneous architectures suffer from performance portability, as one program can exhibit unequal performance on different devices

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    Task-based multifrontal QR solver for heterogeneous architectures

    Get PDF
    Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. !!br0ken!!ommunications" (Communication Avoiding) dans la méthode multifrontale, permettant d'améliorer considérablement la scalabilité du solveur par rapport a l'approche original utilisée dans qr mumps. Nous introduisons également un algorithme d'ordonnancement sous contraintes mémoire au sein de notre solveur, exploitable dans le cas des architectures multicoeur, réduisant largement la consommation mémoire de la méthode multifrontale QR avec un impacte négligeable sur les performances. En utilisant le modèle présenté ci-dessus, nous visons ensuite l'exploitation des architectures hétérogènes pour lesquelles la granularité des tâches ainsi les stratégies l'ordonnancement sont cruciales pour profiter de la puissance de ces architectures. Nous proposons, dans le cadre de la méthode multifrontale, un partitionnement hiérarchique des données ainsi qu'un algorithme d'ordonnancement capable d'exploiter l'hétérogénéité des ressources. Enfin, nous présentons une étude sur la reproductibilité de l'exécution parallèle de notre problème et nous montrons également l'utilisation d'un modèle de programmation alternatif pour l'implémentation de la méthode multifrontale. L'ensemble des résultats expérimentaux présentés dans cette étude sont évalués avec une analyse détaillée des performance que nous proposons au début de cette étude. Cette analyse de performance permet de mesurer l'impacte de plusieurs effets identifiés sur la scalabilité et la performance de nos algorithmes et nous aide ainsi à comprendre pleinement les résultats obtenu lors des tests effectués avec notre solveur.To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver

    Solveur multifrontal QR à base de tâches pour architectures hétérogènes

    Get PDF
    To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver.Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application
    • …
    corecore