Search CORE

6 research outputs found

Tiled Algorithms for Matrix Computations on Multicore Architectures

Author: Elizabeth R. Jessup
Gita Alaghband
Henricus M Bouwmeester
Henricus M Bouwmeester
Julien Langou Advisor
Stephen Billups
Publication venue
Publication date: 13/03/2013
Field of study

The current computer architecture has moved towards the multi/many-core structure. However, the algorithms in the current sequential dense numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A new family of algorithms, the tile algorithms, has recently been introduced to circumvent this problem. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. The goal of this thesis is to study tiled algorithms in a multi/many-core setting and to provide new algorithms which exploit the current architecture to improve performance relative to current state-of-the-art libraries while maintaining the stability and robustness of these libraries.Comment: PhD Thesis, 2012 http://math.ucdenver.ed

arXiv.org e-Print Archive

CiteSeerX

Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs

Author: Chen Jieyang
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Heterogeneous computing system with both CPUs and GPUs has become a class of widely used hardware architecture in supercomputers. As heterogeneous systems delivering higher computational performance, they are being built with an increasing number of complex components. This is anticipated that these systems will be more susceptible to hardware faults with higher power consumption. Numerical linear algebra libraries are used in a wide spectrum of high-performance scientific applications. Among numerical linear algebra operations, one-sided matrix decompositions can sometimes take a large portion of execution time or even dominate the whole scientific application execution. Due to the computational characteristic of one-sided matrix decompositions, they are very suitable for computation platforms such as heterogeneous systems with CPUs and GPUs. Many works have been done to implement and optimize one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs. However, it is challenging to enable stable and high performance one-sided matrix decompositions running on computing platforms that are unreliable and high energy consumption. So, in this thesis, we aim to develop novel fault tolerance and energy efficiency optimizations for one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs.To improve reliability and energy efficiency, extensive researches have been done on developing and optimizing fault tolerance methods and energy-saving strategies for one-sided matrix decompositions. However, current designs still have several limitations: (1) Little has been done on developing and optimizing fault tolerance method for one-sided matrix decompositions on heterogeneous systems with GPUs; (2) Limited by the protection coverage and strength, existing fault tolerance works provide insufficient protection when applied to one-sided matrix decompositions on heterogeneous systems with GPUs; (3) Lack the knowledge of algorithms, existing system level energy saving solutions cannot achieve the optimal energy savings due to potentially inaccurate and high-cost workload prediction they rely on when they are used in one-sided matrix decompositions; (4) It is challenging to apply both fault tolerance techniques and energy saving strategies to one-side matrix decompositions at the same time given that their current designs are not naturally compatible with each other.To address the first problem, based on the original (Algorithm Based Fault Tolerance) ABFT, we develop the first ABFT for matrix decomposition on heterogeneous systems with GPUs together with the novel storage errors protection and several optimization techniques specifically for GPUs. As for the second problem, we design a novel checksum scheme for ABFT that allows data stored in matrices to be encoded in two dimensions. This stronger checksum encoding mechanism enables much stronger protection including enhanced error propagation protection. In addition, we introduce a more efficient checking scheme. By prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors with optimized checksum verification kernel for GPUs, we can achieve strong protect to matrix decompositions with comparable overhead. For the third problem, to improve energy efficiency for one-sided matrix decompositions, we introduce an algorithm-based energy-saving approach designed to maximize energy savings by utilizing algorithmic characteristics. Our approach can predict program execution behavior much more accurately, which is difficult for system level solutions for applications with variable execution characteristics. Experiments show that our approach can lead to much higher energy saving than existing works. Finally, for the fourth problem, we propose a novel energy saving approach for one-sided matrix decompositions on heterogeneous systems with GPUs. It allows energy saving strategies and fault tolerance techniques to be enabled at the same time without brings performance impact or extra energy cost

Ezid

eScholarship - University of California

Recommended from our members

Fast and Reliable Methods in Numerical Linear Algebra, Signal Processing, and Image Processing

Author: Folberth James
Publication venue: University of Colorado Boulder
Publication date: 01/01/2018
Field of study

In this dissertation we consider numerical methods for a problem in each of numerical linear algebra, digital signal processing, and image processing for super-resolution fluorescence microscopy. We consider first a fast, randomized mixing operation applied to the unpivoted Householder QR factorization. The method is an adaptation of a slower randomized operation that is known to provide a rank-revealing factorization with high probability. We perform a number of experiments to highlight possible uses of our method and give evidence that our algorithm likely also provides a rank-revealing factorization with high probability. In the next chapter we develop fast algorithms for computing the discrete, narrowband cross-ambiguity function (CAF) on a downsampled grid of delay values for the purpose of quickly detecting the location of peaks in the CAF surface. Due to the likelihood of missing a narrow peak on a downsampled grid of delay values, we propose methods to make our algorithms robust against missing peaks. To identify peak locations to high accuracy, we propose a two-step approach: first identify a coarse peak location using one of our delay-decimated CAF algorithms, then compute the CAF on a fine, but very small, grid around the peak to find its precise location. Runtime experiments with our C++ implementations show that our delay-decimated algorithms can give more than an order of magnitude improvement in overall runtime to detect peaks in the CAF surface when compared against standard CAF algorithms. In the final chapter we study non-negative least-squares (NNLS) problems arising from a new technique in super-resolution fluorescence microscopy. The image formation task involves solving many tens of thousands of NNLS problems, each using the same matrix, but different right-hand sides. We take advantage of this special structure by adapting an optimal first-order method to efficiently solve many NNLS problems simultaneously. Our NNLS problems are extremely ill-conditioned, so we also experiment with using a block-diagonal preconditioner and the alternating direction method of multipliers (ADMM) to improve convergence speed. We also develop a safe feature elimination strategy for general NNLS problems. It eliminates features only when they are guaranteed to have weight zero at an optimal point. Our strategy is inspired by recent works in the literature for ℓ₁-regularized least-squares, but a notable exception is that we develop our method to use an inexact, but feasible, primal-dual point pair. This allows us to use feature elimination reliably on the extremely ill-conditioned NNLS problems from our microscopy application. For an example image reconstruction, we use our feature elimination strategy to certify that the reconstructed super-resolved image is unique.</p

CU Scholar Institutional Repository

System- and Data-Driven Methods and Algorithms

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 11/01/2022
Field of study

An increasing complexity of models used to predict real-world systems leads to the need for algorithms to replace complex models with far simpler ones, while preserving the accuracy of the predictions. This two-volume handbook covers methods as well as applications. This first volume focuses on real-time control theory, data assimilation, real-time visualization, high-dimensional state spaces and interaction of different reduction techniques

Directory of Open Access Books (DOAB)

Natural gradient algorithms for training deep neural networks

Author: Puiu Constantin Octavian
Publication venue
Publication date: 23/07/2024
Field of study

In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps. First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC. Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms. Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings

Oxford University Research Archive