4 research outputs found

    Optimisation de l'utilisation du cache dans EUROPLEXUS

    Get PDF
    National audiencein this paper we propose a new data structure organization for EUROPLEXUS: a simulation code developed by the CEA and dedicated to the analysis of fast phenomena of fluids and structures. The approach we propose is built so that the data accessed by the processor operating on a portion of the calculation for a time step are as contiguous as possible. This new distribution will help to minimize the number of cache misses compared to that obtained with the current organization of the data structure. Studies have validated the performance gain achieved with the new organization in the case of large scale problems.Dans cet article, nous proposons une nouvelle organisation de la structure de données d'EUROPLEXUS,un code de simulation en dynamique rapide des fluides et des structures développé par le CEA. Cette nouvelle organisation est construite de telle sorte que les données consultées par le processeur travaillant sur une partie du calcul pendant un pas de temps Ti soient le plus contigües possible afin qu'elles tiennent dans le cache de ce dernier. Cette nouvelle répartition nous permettra de minimiser le nombre de défauts de cache comparé à celui obtenu avec l'organisation actuelle de la structure de données. Les études de performance ont validé le gain réalisé avec la nouvelle organisation des données dans le cas des problèmes de grande taille

    Communication-optimal Parallel and Sequential Cholesky Decomposition

    Full text link
    Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.Comment: 29 pages, 2 tables, 6 figure

    Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves

    No full text
    corecore