21 research outputs found
Application-tailored Linear Algebra Algorithms: A search-based Approach
In this paper, we tackle the problem of automatically generating algorithms
for linear algebra operations by taking advantage of problem-specific
knowledge. In most situations, users possess much more information about the
problem at hand than what current libraries and computing environments accept;
evidence shows that if properly exploited, such information leads to
uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra
compiler that allows users to input matrix equations together with properties
about the operands and the problem itself; for instance, they can specify that
the equation is part of a sequence, and how successive instances are related to
one another. The compiler exploits all this information to guide the generation
of algorithms, to limit the size of the search space, and to avoid redundant
computations. We applied the compiler to equations arising as part of
sensitivity and genome studies; the algorithms produced exhibit, respectively,
100- and 1000-fold speedups
Evaluation and tuning of the Level 3 CUBLAS for graphics processors
The increase in performance of the last generations of graphics processors (GPUs) has made this class of plat-form a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the perfor-mance of the Level 3 operations in CUBLAS, the implemen-tation of BLAS for NVIDIA R © GPUs with unified architec-ture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative im-plementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Deriving Algorithms for Triangular Tridiagonalization a (Skew-)Symmetric Matrix
We apply the FLAME methodology to derive algorithms hand in hand with their
proofs of correctness for the computation of the decomposition
(with and without pivoting) of a skew-symmetric matrix. The approach yields
known as well as new algorithms, presented using the FLAME notation. A number
of BLAS-like primitives are exposed at the core of blocked algorithms that can
attain high performance. The insights can be easily extended to yield
algorithms for computing the decomposition of a symmetric matrix.Comment: 28 page
Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory
Abstract We analyze the performance-power-energy balance of a conventional Intel Xeon multicore processor and two low-power architectures -an Intel Atom processor and a system with a quad-core ARM Cortex A9+NVIDIA Quadro 1000M-using a high performance implementation of Gauss-Jordan elimination (GJE) for matrix inversion. The blocked version of this algorithm employed in the experimental evaluation mostly comprises matrix-matrix products, so that the results from the evaluation carry beyond the simple matrix inversion and are representative for a wide variety of dense linear algebra operations/codes