Search CORE

1,432 research outputs found

CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Author: Finzi Marc
Pleiss Geoff
Potapczynski Andres
Wilson Andrew Gordon
Publication venue
Publication date: 06/09/2023
Field of study

Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.Comment: Code available at https://github.com/wilson-labs/col

arXiv.org e-Print Archive

A New Computational Framework for Efficient Parallelization and Optimization of Large Scale Graph Matching

Author: Marefat Navaz Sahar
Publication venue: LSU Digital Commons
Publication date: 01/01/2017
Field of study

There are so many applications in data fusion, comparison, and recognition that require a robust and efficient algorithm to match features of multiple images. To improve accuracy and get a more stable result is important to take into consideration both local appearance and the pairwise relationship of features. Graphs are a powerful and flexible data structure, allowing for the description of complex relationships between data elements, whose nodes correspond to salient features and edges correspond to relational aspects between features. Therefore, the problem of graph matching is to find a mapping between the two sets of nodes that preserves the relationships between them as much as possible. This graph-matching problem is mathematically formulated as an IQP problem which solving it is NP-hard, and obtaining exact Optima only plausible for very small data. Therefore, handling large-scale scientific visual data is quite limited, necessitating both efficient serial algorithms, as well as scalable parallel formulations. In this thesis, we first focused on exploring techniques to reduce the computation cost as well as memory usage of Pairwise graph matching by adopting a heuristic pruning strategy together with a redundancy pattern suppression scheme. We also modified the structure of the affinity matrix for minimizing memory requirement and parallelizing our algorithm by employing CPU’s and GPU’s accelerated libraries. Any pair of features with similar distance from first image results in same sub-matrices, therefore instead of constructing the whole affinity matrix, we only built the sub-blocked affinity for those distinct feature distances. By employing this scheme not only saved large memory and reduced computation time tremendously but also, the matrix-vector multiplication of gradient computation performed in parallel, where each block-vector calculation computed independently without synchronization. The accelerated libraries such as MKL, cuSparse, cuBlas and thrust applied to solving the GM problem, following the scheme of the spectral matching algorithm. We also extended our work for Multi-graph imaging, since many tasks require finding correspondences across multiple images. Also, considering more graph improves the matching accuracy. Most algorithms obtain approximate solutions for solving the GM NP-hard problem, result in a weak optimal solution. Therefore, we proposed a new solver, which iteratively modified the affinity matrix and binarized the solution by optimizing the original problem with its integer constraints

Louisiana State University

Novel Monte Carlo Methods for Large-Scale Linear Algebra Operations

Author: Ji Hao
Publication venue: ODU Digital Commons
Publication date: 01/07/2016
Field of study

Linear algebra operations play an important role in scientific computing and data analysis. With increasing data volume and complexity in the Big Data era, linear algebra operations are important tools to process massive datasets. On one hand, the advent of modern high-performance computing architectures with increasing computing power has greatly enhanced our capability to deal with a large volume of data. One the other hand, many classical, deterministic numerical linear algebra algorithms have difficulty to scale to handle large data sets. Monte Carlo methods, which are based on statistical sampling, exhibit many attractive properties in dealing with large volume of datasets, including fast approximated results, memory efficiency, reduced data accesses, natural parallelism, and inherent fault tolerance. In this dissertation, we present new Monte Carlo methods to accommodate a set of fundamental and ubiquitous large-scale linear algebra operations, including solving large-scale linear systems, constructing low-rank matrix approximation, and approximating the extreme eigenvalues/ eigenvectors, across modern distributed and parallel computing architectures. First of all, we revisit the classical Ulam-von Neumann Monte Carlo algorithm and derive the necessary and sufficient condition for its convergence. To support a broad family of linear systems, we develop Krylov subspace Monte Carlo solvers that go beyond the use of Neumann series. New algorithms used in the Krylov subspace Monte Carlo solvers include (1) a Breakdown-Free Block Conjugate Gradient algorithm to address the potential rank deficiency problem occurred in block Krylov subspace methods; (2) a Block Conjugate Gradient for Least Squares algorithm to stably approximate the least squares solutions of general linear systems; (3) a BCGLS algorithm with deflation to gain convergence acceleration; and (4) a Monte Carlo Generalized Minimal Residual algorithm based on sampling matrix-vector products to provide fast approximation of solutions. Secondly, we design a rank-revealing randomized Singular Value Decomposition (R3SVD) algorithm for adaptively constructing low-rank matrix approximations to satisfy application-specific accuracy. Thirdly, we study the block power method on Markov Chain Monte Carlo transition matrices and find that the convergence is actually depending on the number of independent vectors in the block. Correspondingly, we develop a sliding window power method to find stationary distribution, which has demonstrated success in modeling stochastic luminal Calcium release site. Fourthly, we take advantage of hybrid CPU-GPU computing platforms to accelerate the performance of the Breakdown-Free Block Conjugate Gradient algorithm and the randomized Singular Value Decomposition algorithm. Finally, we design a Gaussian variant of Freivalds’ algorithm to efficiently verify the correctness of matrix-matrix multiplication while avoiding undetectable fault patterns encountered in deterministic algorithms

Old Dominion University

X-ray CT Image Reconstruction on Highly-Parallel Architectures.

Author: McGaffin Madison G.
Publication venue
Publication date: 01/01/2015
Field of study

Model-based image reconstruction (MBIR) methods for X-ray CT use accurate models of the CT acquisition process, the statistics of the noisy measurements, and noise-reducing regularization to produce potentially higher quality images than conventional methods even at reduced X-ray doses. They do this by minimizing a statistically motivated high-dimensional cost function; the high computational cost of numerically minimizing this function has prevented MBIR methods from reaching ubiquity in the clinic. Modern highly-parallel hardware like graphics processing units (GPUs) may offer the computational resources to solve these reconstruction problems quickly, but simply "translating" existing algorithms designed for conventional processors to the GPU may not fully exploit the hardware's capabilities. This thesis proposes GPU-specialized image denoising and image reconstruction algorithms. The proposed image denoising algorithm uses group coordinate descent with carefully structured groups. The algorithm converges very rapidly: in one experiment, it denoises a 65 megapixel image in about 1.5 seconds, while the popular Chambolle-Pock primal-dual algorithm running on the same hardware takes over a minute to reach the same level of accuracy. For X-ray CT reconstruction, this thesis uses duality and group coordinate ascent to propose an alternative to the popular ordered subsets (OS) method. Similar to OS, the proposed method can use a subset of the data to update the image. Unlike OS, the proposed method is convergent. In one helical CT reconstruction experiment, an implementation of the proposed algorithm using one GPU converges more quickly than a state-of-the-art algorithm converges using four GPUs. Using four GPUs, the proposed algorithm reaches near convergence of a wide-cone axial reconstruction problem with over 220 million voxels in only 11 minutes.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113551/1/mcgaffin_1.pd

CiteSeerX

Deep Blue Documents at the University of Michigan

GPU Accelerated Approach to Numerical Linear Algebra and Matrix Analysis with CFD Applications

Author: Phillips Adam
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/05/2014
Field of study

A GPU accelerated approach to numerical linear algebra and matrix analysis with CFD applications is presented. The works objectives are to (1) develop stable and efficient algorithms utilizing multiple NVIDIA GPUs with CUDA to accelerate common matrix computations, (2) optimize these algorithms through CPU/GPU memory allocation, GPU kernel development, CPU/GPU communication, data transfer and bandwidth control to (3) develop parallel CFD applications for Navier Stokes and Lattice Boltzmann analysis methods. Special consideration will be given to performing the linear algebra algorithms under certain matrix types (banded, dense, diagonal, sparse, symmetric and triangular). Benchmarks are performed for all analyses with baseline CPU times being determined to find speed-up factors and measure computational capability of the GPU accelerated algorithms. The GPU implemented algorithms used in this work along with the optimization techniques performed are measured against preexisting work and test matrices available in the NIST Matrix Market. CFD analysis looked to strengthen the assessment of this work by providing a direct engineering application to analysis that would benefit from matrix optimization techniques and accelerated algorithms. Overall, this work desired to develop optimization for selected linear algebra and matrix computations performed with modern GPU architectures and CUDA developer which were applied directly to mathematical and engineering applications through CFD analysis

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Attention and Anticipation in Fast Visual-Inertial Navigation

Author: Carlone Luca
Karaman Sertac
Publication venue
Publication date: 22/03/2018
Field of study

We study a Visual-Inertial Navigation (VIN) problem in which a robot needs to estimate its state using an on-board camera and an inertial sensor, without any prior knowledge of the external environment. We consider the case in which the robot can allocate limited resources to VIN, due to tight computational constraints. Therefore, we answer the following question: under limited resources, what are the most relevant visual cues to maximize the performance of visual-inertial navigation? Our approach has four key ingredients. First, it is task-driven, in that the selection of the visual cues is guided by a metric quantifying the VIN performance. Second, it exploits the notion of anticipation, since it uses a simplified model for forward-simulation of robot dynamics, predicting the utility of a set of visual cues over a future time horizon. Third, it is efficient and easy to implement, since it leads to a greedy algorithm for the selection of the most relevant visual cues. Fourth, it provides formal performance guarantees: we leverage submodularity to prove that the greedy selection cannot be far from the optimal (combinatorial) selection. Simulations and real experiments on agile drones show that our approach ensures state-of-the-art VIN performance while maintaining a lean processing time. In the easy scenarios, our approach outperforms appearance-based feature selection in terms of localization errors. In the most challenging scenarios, it enables accurate visual-inertial navigation while appearance-based feature selection fails to track robot's motion during aggressive maneuvers.Comment: 20 pages, 7 figures, 2 table

arXiv.org e-Print Archive

DSpace@MIT

Crossref

ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

Author: Di Napoli Edoardo
Springer Paul
Winkelmann Jan
Publication venue
Publication date: 25/05/2018
Field of study

Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.Comment: 33 pages. Submitted to ACM TOM

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Juelich Shared Electronic Resources

Single and Dual-GPU Generalized Sparse Eigenvalue Solvers for Finding a Few Low-Order Resonances of a Microwave Cavity Using the Finite-Element Method

Author: Dziekonski Adam
Mrozowski Michal
Publication venue: 'Brno University of Technology'
Publication date: 01/12/2018
Field of study

This paper presents two fast generalized eigenvalue solvers for sparse symmetric matrices that arise when electromagnetic cavity resonances are investigated using the higher-order finite element method (FEM). To find a few low-order resonances, the locally optimal block conjugate gradient (LOBPCG) algorithm with null-space deflation is applied. The computations are expedited by using one or two graphical processing units (GPUs) as accelerators. The performance of the solver is tested for single and dual GPU hardware setups, making use of two types of GPU: NVIDIA Kepler K40s and NVIDIA Pascal P100s. The speed of the GPU-accelerated solvers is compared to a multithreaded implementation of the same algorithm using a multicore central processing unit (CPU, Intel Xeon E5-2680 v3 with twelve cores). It was found that, even for the least efficient setups, the GPU-accelerated code is approximately twice as fast as a parallel CPU-only implementation

Directory of Open Access Journals

Digital library of Brno University of Technology