Search CORE

21 research outputs found

Application-tailored Linear Algebra Algorithms: A search-based Approach

Author: Bientinesi Paolo
Fabregat-Traver Diego
Publication venue
Publication date: 26/11/2012
Field of study

In this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problem-specific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100- and 1000-fold speedups

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Author: Enrique S. Quintana-ortı́
Francisco D. Igual
Maribel Castillo
Rafael Mayo
Sergio Barrachina
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The increase in performance of the last generations of graphics processors (GPUs) has made this class of plat-form a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the perfor-mance of the Level 3 operations in CUBLAS, the implemen-tation of BLAS for NVIDIA R © GPUs with unified architec-ture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative im-plementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants

CiteSeerX

Crossref

An economist's guide to mechanized reasoning or My computer just proved 84 impossibility theorems

Author: Kerber Manfred
Lange Christoph
Rowat Colin
Publication venue
Publication date
Field of study

University of Birmingham Research Portal

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Author: Dinh David
Simhadri Harsha Vardhan
Tang Yuan
Publication venue
Publication date: 14/02/2016
Field of study

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "

\parallel

" (parallel) and "

;

" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "

\leadsto

" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is

O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right)

, where

Q^{*}

is the cache complexity of task

{\mathsf t}

C_i

is the cost of cache miss at level-

i

cache which is of size

M_i

\sigma\in(0,1)

is a constant, and

p

is the number of processors in an

h

-level cache hierarchy

arXiv.org e-Print Archive

Crossref

Deriving Algorithms for Triangular Tridiagonalization a (Skew-)Symmetric Matrix

Author: Matthews Devin
Myers Maggie
van de Geijn Robert
Xu RuQing G.
Publication venue
Publication date: 17/11/2023
Field of study

We apply the FLAME methodology to derive algorithms hand in hand with their proofs of correctness for the computation of the

L T L^T

decomposition (with and without pivoting) of a skew-symmetric matrix. The approach yields known as well as new algorithms, presented using the FLAME notation. A number of BLAS-like primitives are exposed at the core of blocked algorithms that can attain high performance. The insights can be easily extended to yield algorithms for computing the

L T L^T

decomposition of a symmetric matrix.Comment: 28 page

arXiv.org e-Print Archive

Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory

Author: Clei
Journal
Publication venue
Publication date: 31/03/2020
Field of study

Abstract We analyze the performance-power-energy balance of a conventional Intel Xeon multicore processor and two low-power architectures -an Intel Atom processor and a system with a quad-core ARM Cortex A9+NVIDIA Quadro 1000M-using a high performance implementation of Gauss-Jordan elimination (GJE) for matrix inversion. The blocked version of this algorithm employed in the experimental evaluation mostly comprises matrix-matrix products, so that the results from the evaluation carry beyond the simple matrix inversion and are representative for a wide variety of dense linear algebra operations/codes

CiteSeerX