Search CORE

88 research outputs found

PB-BLAS: a set of parallel block basic linear algebra subprograms

Author: David W. Walker
Jack J. Dongarra
Jaeyoung Choi
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

MIMS EPrints

The University of Manchester - Institutional Repository

Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra

Author: Hasan Md Rakib
Publication venue: LSU Digital Commons
Publication date: 01/01/2017
Field of study

Linear algebra underlies a large proportion of computational problems. With the continuous increase of scale on modern hardware, performance of small sized linear algebra has become increasingly important. To overcome the shortcomings of conventional approaches, we employ a new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes. Our initial research consists of improving the performance of parallel LU factorization in ATLAS for which we were able to achieve up to 2.07x and 2.66x speedup for small problems, up to 91% and 87% of theoretical peak performance for asymptotic problems on a 12-core Intel Xeon and a 32-core AMD Opteron machine, respectively, outperforming all the state-of-the-art libraries at the time. Such performance was achieved via an exhaustive search of all the tuning parameters, which could take days. This motivated us to try to develop a computational model for our LU factorization that could predict those parameters by combining some basic empirical timings and a theoretical model based on the amount of required computations. While our model provided good prediction for mid-to-asymptotic sized problems, there were some unknown factors for small problems that could possibly be answered by extending the ATLAS tuning framework. While this extension is underway, we decided to pursue the model research using simpler serial BLAS-based approach. We investigated and implemented two Level-3 BLAS routines: TRSM and TRMM that are widely used primarily by LAPACK operations like the aforementioned LU factorization. With the microkernel-based approach, we were able to improve the performance of both routines by up to 15% and 73% for square and fat problems, respectively, over prior ATLAS implementations on modern hardware. Finally, with a collaborative research with ARM Inc., we improved the performance of the most important Level-3 BLAS operation GEMM in ATLAS by up to 53% via implementing microkernels for two 64-bit ARM architectures. This automatically improves other BLAS and LAPACK routines that rely on GEMM for high performance

Louisiana State University

Clearer, Simpler and more Efficient LAPACK Routines for Symmetric Positive Definite Band Factorization

Author: Gustavson Fred G.
Quintana-Orti Gregorio
Quintania-Orti Enrique S.
Remon Alfredo
Wasniewski Jerzy
Publication venue: Technical University of Denmark, DTU Informatics, Building 321
Publication date: 01/01/2008
Field of study

Online Research Database In Technology

Recommended from our members

PB-BLAS: A set of parallel block basic linear algebra subprograms

Author: Choi Jaeyoung
Dongarra J.
Walker D. W.
Publication venue: Oak Ridge National Laboratory
Publication date: 31/12/1994
Field of study

We propose a new library of routines for performing dense linear algebra computations on block-partitioned matrices. The routines are referred to as the Block Basic Linear Algebra Subprograms, and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for Non-Uniform Memory Access machines the use of the block BLAS permit certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the Parallel Block Basic Linear Algebra Subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consists of all nine Level 3 BLAS routines, four of the Level-2 BLAS routines, and 2 auxiliary transpose routines. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision real and complex

UNT Digital Library

Performance Improvements of Common Sparse Numerical Linear Algebra Computations

Author: Luszczek Piotr Rafal
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2003
Field of study

Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

University of Tennessee, Knoxville: Trace

CiteSeerX

Efficient Numerical Algorithms for Balanced Stochastic Truncation

Author: Enrique S. Quintana-Ortí
Gregorio Quintana-Ortí
Peter Benner
Publication venue
Publication date: 01/01/2001
Field of study

We propose an efficient numerical algorithm for relative error model reduction based on balanced stochastic truncation. The method uses full-rank factors of the Gramians to be balanced versus each other and exploits the fact that for large-scale systems these Gramians are often of low numerical rank. We use the easy-to-parallelize sign function method as the major computational tool in determining these full-rank factors and demonstrate the numerical performance of the suggested implementation of balanced stochastic truncation model reduction

CiteSeerX

Biblioteka Nauki - repozytorium artykuÅÃ³w

MPG.PuRe