62 research outputs found

    Common Subexpression-based Compression and Multiplication of Sparse Constant Matrices

    Full text link
    In deep learning inference, model parameters are pruned and quantized to reduce the model size. Compression methods and common subexpression (CSE) elimination algorithms are applied on sparse constant matrices to deploy the models on low-cost embedded devices. However, the state-of-the-art CSE elimination methods do not scale well for handling large matrices. They reach hours for extracting CSEs in a 200×200200 \times 200 matrix while their matrix multiplication algorithms execute longer than the conventional matrix multiplication methods. Besides, there exist no compression methods for matrices utilizing CSEs. As a remedy to this problem, a random search-based algorithm is proposed in this paper to extract CSEs in the column pairs of a constant matrix. It produces an adder tree for a 1000×10001000 \times 1000 matrix in a minute. To compress the adder tree, this paper presents a compression format by extending the Compressed Sparse Row (CSR) to include CSEs. While compression rates of more than 50%50\% can be achieved compared to the original CSR format, simulations for a single-core embedded system show that the matrix multiplication execution time can be reduced by 20%20\%

    Some Optimizations of Hardware Multiplication by Constant Matrices

    Get PDF
    International audienceThis paper presents some improvements on the optimization of hardware multiplication by constant matrices. We focus on the automatic generation of circuits that involve constant matrix multiplication, i.e. multiplication of a vector by a constant matrix. The proposed method, based on number recoding and dedicated common sub-expression factorization algorithms was implemented in a VHDL generator. Our algorithms and generator have been extended to the case of some digital filters based on multiplication by a constant matrix and delay operations. The obtained results on several applications have been implemented on FPGAs and compared to previous solutions. Up to 40% area and speed savings are achieved

    Model-driven search-based loop fusion optimization for handwritten code

    Get PDF
    The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor contraction expressions into efficient, parallel Fortran code. A pair of optimizations in the TCE, the fusion and tiling optimizations, have proven successful for minimizing disk-to-memory traffic for dense tensor computations. While other optimizations are specific to tensor contraction expressions, these two model-driven search-based optimization algorithms could also be useful for optimizing handwritten dense array computations to minimize disk to memory traffic. In this thesis, we show how to apply the loop fusion algorithm to handwritten code in a procedural language. While in the TCE the loop fusion algorithm operated on high-level expression trees, in a standard compiler it needs to operate on abstract syntax trees. For simplicity, we use the fusion algorithm only for memory minimization instead of for minimizing disk-to-memory traffic. Also, we limit ourselves to handwritten, dense array computations in which loop bounds expressions are constant, subscript expressions are simple loop variables, and there are no common subexpressions. After type-checking, we canonicalize the abstract syntax tree to move side effects and loop-invariant code out of larger expressions. Using dataflow analysis, we then compute reaching definitions and add use-def chains to the abstract syntax tree. After undoing any partial loop fusion, a generalized loop fusion algorithm traverses the abstract syntax tree together with the use-def chains. Finally, the abstract syntax tree is rewritten to reflect the loop structure found by the loop fusion algorithm. We outline how the constraints on loop bounds expressions and array index expressions could be removed in the future using an algebraic cost model and an analysis of the iteration space using a polyhedral model

    Techniques for Efficient Implementation of FIR and Particle Filtering

    Full text link

    High level synthesis of memory architectures

    Get PDF

    High-level synthesis of VLSI circuits

    Get PDF

    Taylor Expansion Diagrams: A Canonical Representation for Verification of Data Flow Designs

    Full text link

    Compilers that learn to optimise: a probabilistic machine learning approach

    Get PDF
    Compiler optimisation is the process of making a compiler produce better code, i.e. code that, for example, runs faster on a target architecture. Although numerous program transformations for optimisation have been proposed in the literature, these transformations are not always beneficial and they can interact in very complex ways. Traditional approaches adopted by compiler writers fix the order of the transformations and decide when and how these transformations should be applied to a program by using hard-coded heuristics. However, these heuristics require a lot of time and effort to construct and may sacrifice performance on programs they have not been tuned for.This thesis proposes a probabilistic machine learning solution to the compiler optimisation problem that automatically determines "good" optimisation strategies for programs. This approach uses predictive modelling in order to search the space of compiler transformations. Unlike most previous work that learns when/how to apply a single transformation in isolation or a fixed-order set of transformations, the techniques proposed in this thesis are capable of tackling the general problem of predicting "good" sequences of compiler transformations. This is achieved by exploiting transference across programs with two different techniques: Predictive Search Distributions (PSD) and multi-task Gaussian process prediction (multi-task GP). While the former directly addresses the problem of predicting "good" transformation sequences, the latter learns regression models (or proxies) of the performance of the programs in order to rapidly scan the space of transformation sequences.Both methods, PSD and multi-task GP, are formulated as general machine learning techniques. In particular, the PSD method is proposed in order to speed up search in combinatorial optimisation problems by learning a distribution over good solutions on a set of problem in¬ stances and using that distribution to search the optimisation space of a problem that has not been seen before. Likewise, multi-task GP is proposed as a general method for multi-task learning that directly models the correlation between several machine learning tasks, exploiting the shared information across the tasks.Additionally, this thesis presents an extension to the well-known analysis of variance (ANOVA) methodology in order to deal with sequence data. This extension is used to address the problem of optimisation space characterisation by identifying and quantifying the main effects of program transformations and their interactions.Finally, the machine learning methods proposed are successfully applied to a data set that has been generated as a result of the application of source-to-source transformations to 12 C programs from the UTDSP benchmark suite

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings
    corecore