58 research outputs found

    GPRM: a high performance programming framework for manycore processors

    Get PDF
    Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity

    RĂ©soudre des systĂšmes linĂ©aires denses sur des architectures composĂ©es de processeurs multicƓurs et d’accĂ©lerateurs

    Get PDF
    In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers.Dans cette thĂšse de doctorat, nous Ă©tudions des algorithmes et des implĂ©mentations pour accĂ©lĂ©rer la rĂ©solution de systĂšmes linĂ©aires denses en utilisant des architectures composĂ©es de processeurs multicƓurs et d'accĂ©lĂ©rateurs. Nous nous concentrons sur des mĂ©thodes basĂ©es sur la factorisation LU. Le dĂ©veloppement de notre code s'est fait dans le contexte de la bibliothĂšque MAGMA. Tout d'abord nous Ă©tudions diffĂ©rents solveurs CPU/GPU hybrides basĂ©s sur la factorisation LU. Ceux-ci visent Ă  rĂ©duire le surcoĂ»t de communication dĂ» au pivotage. Le premier est basĂ© sur une stratĂ©gie de pivotage dite "communication avoiding" (CALU) alors que le deuxiĂšme utilise un prĂ©conditionnement alĂ©atoire du systĂšme original pour Ă©viter de pivoter (RBT). Nous montrons que ces deux mĂ©thodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisĂ©es sur des architectures hybrides multicƓurs/GPUs. Ensuite nous dĂ©veloppons des solveurs utilisant des techniques de randomisation appliquĂ©es sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette mĂ©thode, nous pouvons Ă©viter l'important surcoĂ»t du pivotage tout en restant stable numĂ©riquement dans la plupart des cas. L'architecture hautement parallĂšle de ces accĂ©lĂ©rateurs nous permet d'effectuer la randomisation de notre systĂšme linĂ©aire Ă  un coĂ»t de calcul trĂšs faible par rapport Ă  la durĂ©e de la factorisation. Finalement, nous Ă©tudions l'impact d'accĂšs mĂ©moire non uniformes (NUMA) sur la rĂ©solution de systĂšmes linĂ©aires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement appropriĂ© des processus lĂ©gers et des donnĂ©es sur une architecture NUMA peut amĂ©liorer les performances pour la factorisation du panel et accĂ©lĂ©rer de maniĂšre consĂ©quente la factorisation LU globale. Nous montrons comment ces placements peuvent amĂ©liorer les performances quand ils sont appliquĂ©s Ă  des solveurs hybrides multicƓurs/GPU

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    Un modÚle de programmation à grain fin pour la parallélisation de solveurs linéaires creux

    Get PDF
    Solving large sparse linear system is an essential part of numerical simulations. These resolve can takeup to 80% of the total of the simulation time.An efficient parallelization of sparse linear kernels leads to better performances. In distributed memory,parallelization of these kernels is often done by changing the numerical scheme. Contrariwise, in sharedmemory, a more efficient parallelism can be used. It’s necessary to use two levels of parallelism, a first onebetween nodes of a cluster and a second inside a node.When using iterative methods in shared memory, task-based programming enables the possibility tonaturally describe the parallelism by using as granularity one line of the matrix for one task. Unfortunately,this granularity is too fine and doesn’t allow to obtain good performance.In this thesis, we study the granularity problem of the task-based parallelization. We offer to increasegrain size of computational tasks by creating aggregates of tasks which will become tasks themself. Thenew coarser task graph is composed by the set of these aggregates and the new dependencies betweenaggregates. Then a task scheduler schedules this new graph to obtain better performance. We use as examplethe Incomplete LU factorization of a sparse matrix and we show some improvements made by this method.Then, we focus on NUMA architecture computer. When we use a memory bandwidth limited algorithm onthis architecture, it is interesting to reduce NUMA effects. We show how to take into account these effects ina task-based runtime in order to improve performance of a parallel program.La rĂ©solution de grands systĂšmes linĂ©aires creux est un Ă©lĂ©ment essentiel des simulations numĂ©riques.Ces rĂ©solutions peuvent reprĂ©senter jusqu’à 80% du temps de calcul des simulations.Une parallĂ©lisation efficace des noyaux d’algĂšbre linĂ©aire creuse conduira donc Ă  obtenir de meilleures performances. En mĂ©moire distribuĂ©e, la parallĂ©lisation de ces noyaux se fait le plus souvent en modifiant leschĂ©ma numĂ©rique. Par contre, en mĂ©moire partagĂ©e, un parallĂ©lisme plus efficace peut ĂȘtre utilisĂ©. Il est doncimportant d’utiliser deux niveaux de parallĂ©lisme, un premier niveau entre les noeuds d’une grappe de serveuret un deuxiĂšme niveau Ă  l’intĂ©rieur du noeud. Lors de l’utilisation de mĂ©thodes itĂ©ratives en mĂ©moire partagĂ©e,les graphes de tĂąches permettent de dĂ©crire naturellement le parallĂ©lisme en prenant comme granularitĂ© letravail sur une ligne de la matrice. Malheureusement, cette granularitĂ© est trop fine et ne permet pas d’obtenirde bonnes performances Ă  cause du surcoĂ»t de l’ordonnanceur de tĂąches.Dans cette thĂšse, nous Ă©tudions le problĂšme de la granularitĂ© pour la parallĂ©lisation par graphe detĂąches. Nous proposons d’augmenter la granularitĂ© des tĂąches de calcul en crĂ©ant des agrĂ©gats de tĂąchesqui deviendront eux-mĂȘmes des tĂąches. L’ensemble de ces agrĂ©gats et des nouvelles dĂ©pendances entre lesagrĂ©gats forme un graphe de granularitĂ© plus grossiĂšre. Ce graphe est ensuite utilisĂ© par un ordonnanceur detĂąches pour obtenir de meilleurs rĂ©sultats. Nous utilisons comme exemple la factorisation LU incomplĂšte d’unematrice creuse et nous montrons les amĂ©liorations apportĂ©es par cette mĂ©thode. Puis, dans un second temps,nous nous concentrons sur les machines Ă  architecture NUMA. Dans le cas de l’utilisation d’algorithmeslimitĂ©s par la bande passante mĂ©moire, il est intĂ©ressant de rĂ©duire les effets NUMA liĂ©s Ă  cette architectureen plaçant soi-mĂȘme les donnĂ©es. Nous montrons comment prendre en compte ces effets dans un intergiciel Ă base de tĂąches pour ainsi amĂ©liorer les performances d’un programme parallĂšle

    Productive and efficient computational science through domain-specific abstractions

    Get PDF
    In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

    Lattice Quantum Chromodynamics on Intel Xeon Phi based supercomputers

    Get PDF
    Preface The aim of this master\u2019s thesis project was to expand the QPhiX library for twisted-mass fermions with and without clover term. To this end, I continued work initiated by Mario Schr\uf6ck et al. [63]. In writing this thesis, I was following two main goals. Firstly, I wanted to stress the intricate interplay of the four pillars of High Performance Computing: Algorithms, Hardware, Software and Performance Evaluation. Surely, algorithmic development is utterly important in Scientific Computing, in particular in LQCD, where it even outweighed the improvements made in Hardware architecture in the last decade\u2014cf. the section about computational costs of LQCD. It is strongly influenced by the available hardware\u2014think of the advent of parallel algorithms\u2014but in turn also influenced the design of hardware itself. The IBM BlueGene series is only one of many examples in LQCD. Furthermore, there will be no benefit from the best algorithms, when one cannot implement the ideas into correct, performant, user-friendly, read- and maintainable (sometimes over several decades) software code. But again, truly outstanding HPC software cannot be written without a profound knowledge of its target hardware. Lastly, an HPC software architect and computational scientist has to be able to evaluate and benchmark the performance of a software program, in the often very heterogeneous environment of supercomputers with multiple software and hardware layers. My second goal in writing this thesis was to produce a self-contained introduction into the computational aspects of LQCD and in particular, to the features of QPhiX, so the reader would be able to compile, read and understand the code of one truly amazing pearl of HPC [40]. It is a pleasure to thank S. Cozzini, R. Frezzotti, E. Gregory, B. Jo\uf3, B. Kostrzewa, S. Krieg, T. Luu, G. Martinelli, R. Percacci, S. Simula, M. Ueding, C. Urbach, M. Werner, the Intel company for providing me with a copy of [55], and the J\ufclich Supercomputing Center for granting me access to their KNL test cluster DEE
    • 

    corecore