88 research outputs found

    A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

    Full text link
    As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

    Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra

    Get PDF
    Linear algebra underlies a large proportion of computational problems. With the continuous increase of scale on modern hardware, performance of small sized linear algebra has become increasingly important. To overcome the shortcomings of conventional approaches, we employ a new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes. Our initial research consists of improving the performance of parallel LU factorization in ATLAS for which we were able to achieve up to 2.07x and 2.66x speedup for small problems, up to 91% and 87% of theoretical peak performance for asymptotic problems on a 12-core Intel Xeon and a 32-core AMD Opteron machine, respectively, outperforming all the state-of-the-art libraries at the time. Such performance was achieved via an exhaustive search of all the tuning parameters, which could take days. This motivated us to try to develop a computational model for our LU factorization that could predict those parameters by combining some basic empirical timings and a theoretical model based on the amount of required computations. While our model provided good prediction for mid-to-asymptotic sized problems, there were some unknown factors for small problems that could possibly be answered by extending the ATLAS tuning framework. While this extension is underway, we decided to pursue the model research using simpler serial BLAS-based approach. We investigated and implemented two Level-3 BLAS routines: TRSM and TRMM that are widely used primarily by LAPACK operations like the aforementioned LU factorization. With the microkernel-based approach, we were able to improve the performance of both routines by up to 15% and 73% for square and fat problems, respectively, over prior ATLAS implementations on modern hardware. Finally, with a collaborative research with ARM Inc., we improved the performance of the most important Level-3 BLAS operation GEMM in ATLAS by up to 53% via implementing microkernels for two 64-bit ARM architectures. This automatically improves other BLAS and LAPACK routines that rely on GEMM for high performance

    Performance Improvements of Common Sparse Numerical Linear Algebra Computations

    Get PDF
    Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

    Efficient Numerical Algorithms for Balanced Stochastic Truncation

    Get PDF
    We propose an efficient numerical algorithm for relative error model reduction based on balanced stochastic truncation. The method uses full-rank factors of the Gramians to be balanced versus each other and exploits the fact that for large-scale systems these Gramians are often of low numerical rank. We use the easy-to-parallelize sign function method as the major computational tool in determining these full-rank factors and demonstrate the numerical performance of the suggested implementation of balanced stochastic truncation model reduction
    • 

    corecore