96 research outputs found
Real-Time, Dynamic Hardware Accelerators for BLAS Computation
This paper presents an approach to increasing the capability of scientific computing through the use of real-time, partially reconfigurable hardware accelerators that implement basic linear algebra subprograms (BLAS). The use of reconfigurable hardware accelerators for computing linear algebra functions has the potential to increase floating point computation while at the same time providing an architecture that minimizes data movement latency and increase power efficiency. While there has been significant work by the computing community to optimize BLAS routines at the software level, optimizing these routines in hardware using reconfigurable fabrics is in its infancy. This paper begins with a comprehensive overview of the history and evolution of BLAS for use in scientific computing. In the reviews current successes in using reconfigurable computing architectures achieve acceleration. It then presents an investigation of an accelerator approach with a granularity at the logic circuit level through real-time, partial reconfiguration of a programmable fabric with static accelerator cache memory to minimize data movement. Empirical data is presented for a study on a single-FPGA
Developing numerical libraries in Java
The rapid and widespread adoption of Java has created a demand for reliable
and reusable mathematical software components to support the growing number of
compute-intensive applications now under development, particularly in science
and engineering. In this paper we address practical issues of the Java language
and environment which have an effect on numerical library design and
development. Benchmarks which illustrate the current levels of performance of
key numerical kernels on a variety of Java platforms are presented. Finally, a
strategy for the development of a fundamental numerical toolkit for Java is
proposed and its current status is described.Comment: 11 pages. Revised version of paper presented to the 1998 ACM
Conference on Java for High Performance Network Computing. To appear in
Concurrency: Practice and Experienc
The iterative solver template library
The numerical solution of partial differential equations frequently requires the solution of large and sparse linear systems. Using generic programming techniques like in C++ one can create solver libraries that allow efficient realization of "fine grained interfaces", i. e. with functions consisting only of a few lines, like access to individual matrix entries. This prevents code replication and allows programmers to work more efficiently.
In this paper we present the "Iterative Solver Template Library" (ISTL) which is part of the "Distributed and Unified Numerics Environment" (DUNE). It applies generic programming in C++ to the domain of iterative solvers of linear systems stemming from finite element discretizations. Those discretizations exhibit a lot of structure. Our matrix and vector interface supports a block recursive structure. I. E. each sparse matrix entry can be a sparse or a small dense matrix itself. Based on this interface we present efficient solvers that use the recursive block structure via template metaprogramming
Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies
Programming of commodity multicore processors is a challenging task and it becomes even harder when the processor has an explicitly-managed memory hierarchy (EMMA). Software libraries in the field of matrix algebra try to keep pace with this challenge by using the dataflow model of computation and constructing tiled algorithms. A new approach to high-performance software library construction is proposed, which moves scheduling decisions to compile-time and is portable between different EMMA platforms. Performance and scalability analyses both demonstrate promising results. Experiments demonstrate near linear speedup on a synthetic multicore architecture, incorporating up to 16 working computational cores. Performance of a generated code is competitive with vendor BLAS implementations for the Cell processor
Using machine learning to improve dense and sparse matrix multiplication kernels
This work is comprised of two different projects in numerical linear algebra. The first project is about using machine learning to speed up dense matrix-matrix multiplication computations on a shared-memory computer architecture. We found that found basic loop-based matrix-matrix multiplication algorithms tied to a decision tree algorithm selector were competitive to using Intel\u27s Math Kernel Library for the same computation. The second project is a preliminary report about re-implementing an encoding format for spare matrix-vector multiplication called Compressed Spare eXtended (CSX). The goal for the second project is to use machine learning to aid in encoding matrix substructures in the CSX format without using exhaustive search and a Just-In-Time compiler
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance
We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be-
come rich in operations that can achieve near-peak performance on a modern processor. The key is a
novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular
vector matrix. This algorithm is then implemented with optimizations that (1) leverage vector instruction
units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of
memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian
eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigen-
vectors/singular vectors are computed. The approach yields vastly improved performance relative to the
traditional QR algorithms for these problems and is competitive with two commonly used alternatives—
Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—
while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the
computations performed by the restructured algorithms remain essentially identical to those performed by
the original methods, robust numerical properties are preserved
- …