779 research outputs found
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size
Interactive Medical Image Registration With Multigrid Methods and Bounded Biharmonic Functions
Interactive image registration is important in some medical applications since automatic image registration is often slow and sometimes error-prone. We consider interactive registration methods that incorporate user-specified local transforms around control handles. The deformation between handles is interpolated by some smooth functions, minimizing some variational energies. Besides smoothness, we expect the impact of a control handle to be local. Therefore we choose bounded biharmonic weight functions to blend local transforms, a cutting-edge technique in computer graphics. However, medical images are usually huge, and this technique takes a lot of time that makes itself impracticable for interactive image registration.
To expedite this process, we use a multigrid active set method to solve bounded biharmonic functions (BBF). The multigrid approach is for two scenarios, refining the active set from coarse to fine resolutions, and solving the linear systems constrained by working active sets. We\u27ve implemented both weighted Jacobi method and successive over-relaxation (SOR) in the multigrid solver. Since the problem has box constraints, we cannot directly use regular updates in Jacobi and SOR methods. Instead, we choose a descent step size and clamp the update to satisfy the box constraints. We explore the ways to choose step sizes and discuss their relation to the spectral radii of the iteration matrices. The relaxation factors, which are closely related to step sizes, are estimated by analyzing the eigenvalues of the bilaplacian matrices. We give a proof about the termination of our algorithm and provide some theoretical error bounds.
Another minor problem we address is to register big images on GPU with limited memory. We\u27ve implemented an image registration algorithm with virtual image slices on GPU. An image slice is treated similarly to a page in virtual memory. We execute a wavefront of subtasks together to reduce the number of data transfers.
Our main contribution is a fast multigrid method for interactive medical image registration that uses bounded biharmonic functions to blend local transforms. We report a novel multigrid approach to refine active set quickly and use clamped updates based on weighted Jacobi and SOR. This multigrid method can be used to efficiently solve other quadratic programs that have active sets distributed over continuous regions
I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels
International audienceIn this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank- update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of for the communication volume of the Cholesky factorization of an symmetric positive definite matrix, and of for the SYRK computation of , where is an matrix. Both bounds improve the best known lower bounds from the literature by a factor .In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of , and \LBC for Cholesky, with a volume of . Both algorithms improve over the best known algorithms from the literature by a factor , and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor ) than that of corresponding non-symmetric kernels (GEMM and LU factorization)
Computer Aided Verification
This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book
On Algorithmic Cache Optimization
We study matrix-matrix multiplication of two matrices, and , each of
size . This operation results in a matrix of size .
Our goal is to produce as efficiently as possible given a cache: a 1-D
limited set of data values that we can work with to perform elementary
operations (additions, multiplications, etc.). That is, we attempt to reuse the
maximum amount of data from , and during our computation (or
equivalently, utilize data in the fast-access cache as often as possible).
Firstly, we introduce the matrix-matrix multiplication algorithm. Secondly, we
present a standard two-memory model to simulate the architecture of a computer,
and we explain the LRU (Least Recently Used) Cache policy (which is standard in
most computers). Thirdly, we introduce a basic model Cache Simulator, which
possesses an time complexity (meaning we are limited to small
values). Then we discuss and model the LFU (Least Frequently Used) Cache
policy and the explicit control cache policy. Finally, we introduce the main
result of this paper, the Cache Simulator, and use it to
compare, experimentally, the savings of time, energy, and communication
incurred from the ideal cache-efficient algorithm for matrix-matrix
multiplication. The Cache Simulator simulates the amount of data movement that
occurs between the main memory and the cache of the computer. One of the
findings of this project is that, in some cases, there is a significant
discrepancy in communication values between an LRU cache algorithm and explicit
cache control. We propose to alleviate this problem by ``tricking'' the LRU
cache algorithm by updating the timestamp of the data we want to keep in cache
(namely entries of matrix ). This enables us to have the benefits of an
explicit cache policy while being constrained by the LRU paradigm (realistic
policy on a CPU).Comment: 20 pages, 3 figures, 2 table
Polyhedral+Dataflow Graphs
This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains.
The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
- …