58 research outputs found

    Automatic derivation and implementation of fast convolution algorithms

    Get PDF
    This thesis surveys algorithms for computing linear and cyclic convolution. Algorithms are presented in a uniform mathematical notation that allows automatic derivation, optimization, and implementation. Using the tensor product and Chinese Remainder Theorem (CRT), a space of algorithms is defined and the task of finding the best algorithm is turned into an optimization problem over this space of algorithms. This formulation led to the discovery of new algorithms with reduced operation count. Symbolic tools are presented for deriving and implementing algorithms, and performance analyses (using both operation count and run-time as metrics) are carried out. These analyses show the existence of a window where CRT-based algorithms outperform other methods of computing convolutions. Finally a new method that combines the Fast Fourier transform with the CRT methods is derived. This latter method is shown to be faster for some very large size convolutions than either method used alone.Ph.D., Computer Science -- Drexel University, 200

    Linear analysis and optimization of stream programs

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 123-127).As more complex DSP algorithms are realized in practice, there is an increasing need for high-level stream abstractions that can be compiled without sacrificing efficiency. Toward this end, we present a set of aggressive optimizations that target linear sections of a stream program. Our input language is StreamIt, which represents programs as a hierarchical graph of autonomous filters. A filter is linear if each of its outputs can be represented as an affine combination of its inputs. Linearity is common in DSP components; examples include FIR filters, expanders, compressors, FFTs and DCTs. We demonstrate that several algorithmic transformations, traditionally handtuned by DSP experts, can be completely automated by the compiler. First, we present a linear extraction analysis that automatically detects linear filters from the C-like code in their work function. Then, we give a procedure for combining adjacent linear filters into a single filter, a specialized caching strategy to remove redundant computations, and a method for translating a linear filter to operate in the frequency domain. We also present an optimization selection algorithm, which finds the sequence of combination and frequency transformations that yields the maximal benefit. We have completed a fully-automatic implementation of the above techniques as part of the StreamIt compiler. Using a suite of benchmarks, we show that our optimizations remove, on average, 86% of the floating point instructions required. In addition, we demonstrate an average execution time decrease of 450% and an 800% decrease in the best case.by Andrew Allinson Lamb.M.Eng

    Model-driven search-based loop fusion optimization for handwritten code

    Get PDF
    The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor contraction expressions into efficient, parallel Fortran code. A pair of optimizations in the TCE, the fusion and tiling optimizations, have proven successful for minimizing disk-to-memory traffic for dense tensor computations. While other optimizations are specific to tensor contraction expressions, these two model-driven search-based optimization algorithms could also be useful for optimizing handwritten dense array computations to minimize disk to memory traffic. In this thesis, we show how to apply the loop fusion algorithm to handwritten code in a procedural language. While in the TCE the loop fusion algorithm operated on high-level expression trees, in a standard compiler it needs to operate on abstract syntax trees. For simplicity, we use the fusion algorithm only for memory minimization instead of for minimizing disk-to-memory traffic. Also, we limit ourselves to handwritten, dense array computations in which loop bounds expressions are constant, subscript expressions are simple loop variables, and there are no common subexpressions. After type-checking, we canonicalize the abstract syntax tree to move side effects and loop-invariant code out of larger expressions. Using dataflow analysis, we then compute reaching definitions and add use-def chains to the abstract syntax tree. After undoing any partial loop fusion, a generalized loop fusion algorithm traverses the abstract syntax tree together with the use-def chains. Finally, the abstract syntax tree is rewritten to reflect the loop structure found by the loop fusion algorithm. We outline how the constraints on loop bounds expressions and array index expressions could be removed in the future using an algebraic cost model and an analysis of the iteration space using a polyhedral model

    The Design and Implementation of FFTW3

    Full text link

    Search-based Model-driven Loop Optimizations for Tensor Contractions

    Get PDF
    Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs

    Advanced data management system analysis techniques study

    Get PDF
    The state of the art of system analysis is reviewed, emphasizing data management. Analytic, hardware, and software techniques are described

    Using Machine Learning to Automate Compiler Optimisation

    Get PDF
    Institute for Computing Systems ArchitectureMany optimisations in modern compilers have been traditionally based around using analysis to examine certain aspects of the code; the compiler heuristics then make a decision based on this information as to what to optimise, where to optimise and to what extent to optimise. The exact contents of these heuristics have been carefully tuned by experts, using their experience, as well as analytical tools, to produce solid performance. This work proposes an alternative approach – that of using proper statistical analysis to drive these optimisation goals instead of human intuition, through the use of machine learning. This work shows how, by using a probabilistic search of the optimisation space, we can achieve a significant speedup over the baseline compiler with the highest optimisation settings, on a number of different processor architectures. Additionally, there follows a further methodology for speeding up this search by being able to transfer our knowledge of one program to another. This thesis shows that, as is the case in many other domains, programs can be successfully represented by program features, which can then be used to gauge their similarity and thus the applicability of previously learned off-line knowledge. Employing this method, we are able to gain the same results in terms of performance, reducing the time taken by an order of magnitude. Finally, it is demonstrated how statistical analysis of programs allows us to learn additional important optimisation information, purely by examining the features alone. By incorporating this additional information into our model, we show how good results can be achieved in just one compilation. This work is tested on real hardware, for both the embedded and general purpose domain, showing its wide applicability
    • …
    corecore