182 research outputs found

    Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

    Get PDF
    International audienceWe present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures.Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization.Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebraover a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies

    Symmetric indefinite triangular factorization revealing the rank profile matrix

    Get PDF
    We present a novel recursive algorithm for reducing a symmetric matrix to a triangular factorization which reveals the rank profile matrix. That is, the algorithm computes a factorization PTAP=LDLT\mathbf{P}^T\mathbf{A}\mathbf{P} = \mathbf{L}\mathbf{D}\mathbf{L}^T where P\mathbf{P} is a permutation matrix, L\mathbf{L} is lower triangular with a unit diagonal and D\mathbf{D} is symmetric block diagonal with 1×11{\times}1 and 2×22{\times}2 antidiagonal blocks. The novel algorithm requires O(n2rω−2)O(n^2r^{\omega-2}) arithmetic operations. Furthermore, experimental results demonstrate that our algorithm can even be slightly more than twice as fast as the state of the art unsymmetric Gaussian elimination in most cases, that is it achieves approximately the same computational speed. By adapting the pivoting strategy developed in the unsymmetric case, we show how to recover the rank profile matrix from the permutation matrix and the support of the block-diagonal matrix. There is an obstruction in characteristic 22 for revealing the rank profile matrix which requires to relax the shape of the block diagonal by allowing the 2-dimensional blocks to have a non-zero bottom-right coefficient. This relaxed decomposition can then be transformed into a standard PLDLTPT\mathbf{P}\mathbf{L}\mathbf{D}\mathbf{L}^T\mathbf{P}^T decomposition at a negligible cost

    Exascale Ready Work-Optimal Matrix Inversion

    Get PDF
    In dieser Diplomarbeit entwickle ich einen neuen Algorithmus OPT zur Inversion von Matrizen. Ich beweise Eigenschaften zur parallelen Laufzeit und zum Arbeitsaufwand von OPT. OPT ist kombiniert aus Strassens Algorithmus zur Inversion von Matrizen und aus Newton Approximation und basiert auf einer Subroutine zur Matrixmultiplikation. OPT ist ein arbeitsoptimaler Algorithmus, d.h. er benötigt höchstens einen konstanten Faktor mehr Arbeit als jeder andere (arbeitsoptimale) Algorithmus. Außerdem benötigt OPT nur plylogarithmische Zeit auf höchstens O(n3) Prozessoren, wobei die Prozessorzahl von der Multiplikationsroutine bestimmt wird. Damit vereint er diese beiden Vorteile von Strassens Algorithmus und Newton Approximation. Ich beweise eine neue AbschĂ€tzung zur numerischen StabilitĂ€t von Strassens Algorithmus kombiniert mit Newton Approximation. Im Zuge der Diplomarbeit habe ich OPT, Strassens Inversionsalgorithmus und Newton Approximation zusammen mit einer Matrixcontainerklasse in einem flexiblen Testprogramm implementiert. Ich beschreibe das Design der Implementierung und die Verwendung und Schwierigkeiten von BLAS fĂŒr die Matrixmultiplikationsroutine. Im experimentellen Teil vergleiche ich die Laufzeit und die numerische StabilitĂ€t von OPT mit der Routine aus der Intel Math Kernel Library (MKL). Die konstanten Faktoren des Arbeitsaufwands von OPT erweisen sich als nicht mehr als doppelt so hoch wie die der MKL-Routine. Wie vorhergesagt skaliert OPT sehr gut. Selbst auf einem Computer mit nur acht Kernen ist er bereits deutlich schneller als die MKL-Routine. BezĂŒglich der numerischen StabilitĂ€t werden OPT und Strassens Algorithmus dessen schlechtem Ruf nicht gerecht. Stattdessen erzeugen sie Ergebnisse vergleichbar mit denen der MKL-Routine. Ich entdecke eine unerwartete InstabilitĂ€t von Newton Approximation wodurch sie schlechtere Ergebnisse erzeugt als alle anderen Algorithmen in der Implementierung. Zu dieser InstabilitĂ€t prĂ€sentiere ich einige weitere Experimente

    A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

    Full text link
    As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

    Parallel methods for linear systems solution in extreme learning machines: an overview

    Get PDF
    This paper aims to present an updated review of parallel algorithms for solving square and rectangular single and double precision matrix linear systems using multi-core central processing units and graphic processing units. A brief description of the methods for the solution of linear systems based on operations, factorization and iterations was made. The methodology implemented, in this article, is a documentary and it was based on the review of about 17 papers reported in the literature during the last five years (2016-2020). The disclosed findings demonstrate the potential of parallelism to significantly decrease extreme learning machines training times for problems with large amounts of data given the calculation of the Moore Penrose pseudo inverse. The implementation of parallel algorithms in the calculation of the pseudo-inverse will allow to contribute significantly in the applications of diversifying areas, since it can accelerate the training time of the extreme learning machines with optimal results

    Reconstructing Rational Functions with FireFly\texttt{FireFly}

    Full text link
    We present the open-source C++\texttt{C++} library FireFly\texttt{FireFly} for the reconstruction of multivariate rational functions over finite fields. We discuss the involved algorithms and their implementation. As an application, we use FireFly\texttt{FireFly} in the context of integration-by-parts reductions and compare runtime and memory consumption to a fully algebraic approach with the program Kira\texttt{Kira}.Comment: 46 pages, 3 figures, 6 tables; v2: matches published versio

    BCYCLIC: A parallel block tridiagonal matrix cyclic solver

    Get PDF
    13 pages, 6 figures.A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.This research has been sponsored by the US Department of Energy under Contract DE-AC05-00OR22725 with UT-Battelle, LLC. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.Publicad

    A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization

    Full text link
    We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver
    • 

    corecore