779 research outputs found

    On Characterizing the Data Access Complexity of Programs

    Full text link
    Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

    Interactive Medical Image Registration With Multigrid Methods and Bounded Biharmonic Functions

    Get PDF
    Interactive image registration is important in some medical applications since automatic image registration is often slow and sometimes error-prone. We consider interactive registration methods that incorporate user-specified local transforms around control handles. The deformation between handles is interpolated by some smooth functions, minimizing some variational energies. Besides smoothness, we expect the impact of a control handle to be local. Therefore we choose bounded biharmonic weight functions to blend local transforms, a cutting-edge technique in computer graphics. However, medical images are usually huge, and this technique takes a lot of time that makes itself impracticable for interactive image registration. To expedite this process, we use a multigrid active set method to solve bounded biharmonic functions (BBF). The multigrid approach is for two scenarios, refining the active set from coarse to fine resolutions, and solving the linear systems constrained by working active sets. We\u27ve implemented both weighted Jacobi method and successive over-relaxation (SOR) in the multigrid solver. Since the problem has box constraints, we cannot directly use regular updates in Jacobi and SOR methods. Instead, we choose a descent step size and clamp the update to satisfy the box constraints. We explore the ways to choose step sizes and discuss their relation to the spectral radii of the iteration matrices. The relaxation factors, which are closely related to step sizes, are estimated by analyzing the eigenvalues of the bilaplacian matrices. We give a proof about the termination of our algorithm and provide some theoretical error bounds. Another minor problem we address is to register big images on GPU with limited memory. We\u27ve implemented an image registration algorithm with virtual image slices on GPU. An image slice is treated similarly to a page in virtual memory. We execute a wavefront of subtasks together to reduce the number of data transfers. Our main contribution is a fast multigrid method for interactive medical image registration that uses bounded biharmonic functions to blend local transforms. We report a novel multigrid approach to refine active set quickly and use clamped updates based on weighted Jacobi and SOR. This multigrid method can be used to efficiently solve other quadratic programs that have active sets distributed over continuous regions

    I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

    Get PDF
    International audienceIn this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-kk update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size SS and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of 132N3S\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} for the communication volume of the Cholesky factorization of an N×NN\times N symmetric positive definite matrix, and of 12N2MS\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} for the SYRK computation of AAT\mathbf{A}\cdot\mathbf{A}^{T}, where A\mathbf{A} is an N×MN\times M matrix. Both bounds improve the best known lower bounds from the literature by a factor 2\sqrt{2}.In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of 12N2MS+O(NMlogN)\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} + \mathcal{O}(NM\log N), and \LBC for Cholesky, with a volume of 132N3S+O(N5/2)\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} + \mathcal{O}(N^{5/2}). Both algorithms improve over the best known algorithms from the literature by a factor 2\sqrt{2}, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor 2\sqrt{2}) than that of corresponding non-symmetric kernels (GEMM and LU factorization)

    Computer Aided Verification

    Get PDF
    This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book

    On Algorithmic Cache Optimization

    Full text link
    We study matrix-matrix multiplication of two matrices, AA and BB, each of size n×nn \times n. This operation results in a matrix CC of size n×nn\times n. Our goal is to produce CC as efficiently as possible given a cache: a 1-D limited set of data values that we can work with to perform elementary operations (additions, multiplications, etc.). That is, we attempt to reuse the maximum amount of data from AA, BB and CC during our computation (or equivalently, utilize data in the fast-access cache as often as possible). Firstly, we introduce the matrix-matrix multiplication algorithm. Secondly, we present a standard two-memory model to simulate the architecture of a computer, and we explain the LRU (Least Recently Used) Cache policy (which is standard in most computers). Thirdly, we introduce a basic model Cache Simulator, which possesses an O(M)\mathcal{O}(M) time complexity (meaning we are limited to small MM values). Then we discuss and model the LFU (Least Frequently Used) Cache policy and the explicit control cache policy. Finally, we introduce the main result of this paper, the O(1)\mathcal{O}(1) Cache Simulator, and use it to compare, experimentally, the savings of time, energy, and communication incurred from the ideal cache-efficient algorithm for matrix-matrix multiplication. The Cache Simulator simulates the amount of data movement that occurs between the main memory and the cache of the computer. One of the findings of this project is that, in some cases, there is a significant discrepancy in communication values between an LRU cache algorithm and explicit cache control. We propose to alleviate this problem by ``tricking'' the LRU cache algorithm by updating the timestamp of the data we want to keep in cache (namely entries of matrix CC). This enables us to have the benefits of an explicit cache policy while being constrained by the LRU paradigm (realistic policy on a CPU).Comment: 20 pages, 3 figures, 2 table

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
    corecore