21 research outputs found

    Application-tailored Linear Algebra Algorithms: A search-based Approach

    Full text link
    In this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problem-specific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100- and 1000-fold speedups

    Evaluation and tuning of the Level 3 CUBLAS for graphics processors

    Full text link
    The increase in performance of the last generations of graphics processors (GPUs) has made this class of plat-form a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the perfor-mance of the Level 3 operations in CUBLAS, the implemen-tation of BLAS for NVIDIA R © GPUs with unified architec-ture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative im-plementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants

    Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

    Full text link
    The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "\parallel" (parallel) and ";;" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "\leadsto" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O(i=0h1Q(t;σMi)Cip)O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right), where QQ^{*} is the cache complexity of task t{\mathsf t}, CiC_i is the cost of cache miss at level-ii cache which is of size MiM_i, σ(0,1)\sigma\in(0,1) is a constant, and pp is the number of processors in an hh-level cache hierarchy

    Deriving Algorithms for Triangular Tridiagonalization a (Skew-)Symmetric Matrix

    Full text link
    We apply the FLAME methodology to derive algorithms hand in hand with their proofs of correctness for the computation of the LTLT L T L^T decomposition (with and without pivoting) of a skew-symmetric matrix. The approach yields known as well as new algorithms, presented using the FLAME notation. A number of BLAS-like primitives are exposed at the core of blocked algorithms that can attain high performance. The insights can be easily extended to yield algorithms for computing the LTLT L T L^T decomposition of a symmetric matrix.Comment: 28 page

    Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory

    Get PDF
    Abstract We analyze the performance-power-energy balance of a conventional Intel Xeon multicore processor and two low-power architectures -an Intel Atom processor and a system with a quad-core ARM Cortex A9+NVIDIA Quadro 1000M-using a high performance implementation of Gauss-Jordan elimination (GJE) for matrix inversion. The blocked version of this algorithm employed in the experimental evaluation mostly comprises matrix-matrix products, so that the results from the evaluation carry beyond the simple matrix inversion and are representative for a wide variety of dense linear algebra operations/codes
    corecore