6 research outputs found
Tiled Algorithms for Matrix Computations on Multicore Architectures
The current computer architecture has moved towards the multi/many-core
structure. However, the algorithms in the current sequential dense numerical
linear algebra libraries (e.g. LAPACK) do not parallelize well on
multi/many-core architectures. A new family of algorithms, the tile algorithms,
has recently been introduced to circumvent this problem. Previous research has
shown that it is possible to write efficient and scalable tile algorithms for
performing a Cholesky factorization, a (pseudo) LU factorization, and a QR
factorization. The goal of this thesis is to study tiled algorithms in a
multi/many-core setting and to provide new algorithms which exploit the current
architecture to improve performance relative to current state-of-the-art
libraries while maintaining the stability and robustness of these libraries.Comment: PhD Thesis, 2012 http://math.ucdenver.ed
Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs
Heterogeneous computing system with both CPUs and GPUs has become a class of widely used hardware architecture in supercomputers. As heterogeneous systems delivering higher computational performance, they are being built with an increasing number of complex components. This is anticipated that these systems will be more susceptible to hardware faults with higher power consumption. Numerical linear algebra libraries are used in a wide spectrum of high-performance scientific applications. Among numerical linear algebra operations, one-sided matrix decompositions can sometimes take a large portion of execution time or even dominate the whole scientific application execution. Due to the computational characteristic of one-sided matrix decompositions, they are very suitable for computation platforms such as heterogeneous systems with CPUs and GPUs. Many works have been done to implement and optimize one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs. However, it is challenging to enable stable and high performance one-sided matrix decompositions running on computing platforms that are unreliable and high energy consumption. So, in this thesis, we aim to develop novel fault tolerance and energy efficiency optimizations for one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs.To improve reliability and energy efficiency, extensive researches have been done on developing and optimizing fault tolerance methods and energy-saving strategies for one-sided matrix decompositions. However, current designs still have several limitations: (1) Little has been done on developing and optimizing fault tolerance method for one-sided matrix decompositions on heterogeneous systems with GPUs; (2) Limited by the protection coverage and strength, existing fault tolerance works provide insufficient protection when applied to one-sided matrix decompositions on heterogeneous systems with GPUs; (3) Lack the knowledge of algorithms, existing system level energy saving solutions cannot achieve the optimal energy savings due to potentially inaccurate and high-cost workload prediction they rely on when they are used in one-sided matrix decompositions; (4) It is challenging to apply both fault tolerance techniques and energy saving strategies to one-side matrix decompositions at the same time given that their current designs are not naturally compatible with each other.To address the first problem, based on the original (Algorithm Based Fault Tolerance) ABFT, we develop the first ABFT for matrix decomposition on heterogeneous systems with GPUs together with the novel storage errors protection and several optimization techniques specifically for GPUs. As for the second problem, we design a novel checksum scheme for ABFT that allows data stored in matrices to be encoded in two dimensions. This stronger checksum encoding mechanism enables much stronger protection including enhanced error propagation protection. In addition, we introduce a more efficient checking scheme. By prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors with optimized checksum verification kernel for GPUs, we can achieve strong protect to matrix decompositions with comparable overhead. For the third problem, to improve energy efficiency for one-sided matrix decompositions, we introduce an algorithm-based energy-saving approach designed to maximize energy savings by utilizing algorithmic characteristics. Our approach can predict program execution behavior much more accurately, which is difficult for system level solutions for applications with variable execution characteristics. Experiments show that our approach can lead to much higher energy saving than existing works. Finally, for the fourth problem, we propose a novel energy saving approach for one-sided matrix decompositions on heterogeneous systems with GPUs. It allows energy saving strategies and fault tolerance techniques to be enabled at the same time without brings performance impact or extra energy cost
Recommended from our members
Fast and Reliable Methods in Numerical Linear Algebra, Signal Processing, and Image Processing
In this dissertation we consider numerical methods for a problem in each of numerical linear algebra, digital signal processing, and image processing for super-resolution fluorescence microscopy. We consider first a fast, randomized mixing operation applied to the unpivoted Householder QR factorization. The method is an adaptation of a slower randomized operation that is known to provide a rank-revealing factorization with high probability. We perform a number of experiments to highlight possible uses of our method and give evidence that our algorithm likely also provides a rank-revealing factorization with high probability.
In the next chapter we develop fast algorithms for computing the discrete, narrowband cross-ambiguity function (CAF) on a downsampled grid of delay values for the purpose of quickly detecting the location of peaks in the CAF surface. Due to the likelihood of missing a narrow peak on a downsampled grid of delay values, we propose methods to make our algorithms robust against missing peaks. To identify peak locations to high accuracy, we propose a two-step approach: first identify a coarse peak location using one of our delay-decimated CAF algorithms, then compute the CAF on a fine, but very small, grid around the peak to find its precise location. Runtime experiments with our C++ implementations show that our delay-decimated algorithms can give more than an order of magnitude improvement in overall runtime to detect peaks in the CAF surface when compared against standard CAF algorithms.
In the final chapter we study non-negative least-squares (NNLS) problems arising from a new technique in super-resolution fluorescence microscopy. The image formation task involves solving many tens of thousands of NNLS problems, each using the same matrix, but different right-hand sides. We take advantage of this special structure by adapting an optimal first-order method to efficiently solve many NNLS problems simultaneously. Our NNLS problems are extremely ill-conditioned, so we also experiment with using a block-diagonal preconditioner and the alternating direction method of multipliers (ADMM) to improve convergence speed. We also develop a safe feature elimination strategy for general NNLS problems. It eliminates features only when they are guaranteed to have weight zero at an optimal point. Our strategy is inspired by recent works in the literature for ℓ₁-regularized least-squares, but a notable exception is that we develop our method to use an inexact, but feasible, primal-dual point pair. This allows us to use feature elimination reliably on the extremely ill-conditioned NNLS problems from our microscopy application. For an example image reconstruction, we use our feature elimination strategy to certify that the reconstructed super-resolved image is unique.</p
System- and Data-Driven Methods and Algorithms
An increasing complexity of models used to predict real-world systems leads to the need for algorithms to replace complex models with far simpler ones, while preserving the accuracy of the predictions. This two-volume handbook covers methods as well as applications. This first volume focuses on real-time control theory, data assimilation, real-time visualization, high-dimensional state spaces and interaction of different reduction techniques
Natural gradient algorithms for training deep neural networks
In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.
First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.
Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.
Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings