13 research outputs found

    Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

    Full text link
    The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor

    Hardware aware tiling optimization for multi-core systems

    Get PDF
    This paper presents a proposition of the new tool which improves tiling efficiencyfor given hardware architecture. This article also describes the correlationbetween changing hardware architecture and methods of software optimization.First chapter includes short description of the change in hardware architecturewhich has occurred in recent 10 years. The second chapter provides an overviewof tools which will be used in further research. The consecutive sections containdescription of proposed hardware-aware tool for optimal tiling

    Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures

    Get PDF
    International audienceTiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix-Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art

    Програмна система автоматизації створення оптимізованих додатків користувача для паралельних вбудованих систем

    Get PDF
    Магістерська дисертація містить 102 сторінки, 75 рисунків, 14 таблиць, 1 додаток, 30 джерел. Об`єкт дослідження: паралельні вбудовані системи. Мета магістерської дисертації: підвищення ефективності оптимізації додатків методом тайліенгу Предмет дослідження: автоматизована система створення оптимізованих додатків користувача для паралельних вбудованих систем. Наукова новизна одержаних у магістерській дисертації результатів полягає у вдосконаленні ефективності оптимізації додатків методом тайліенгу, а саме – у реалізації пошуку оптимальних розмірів тайлів методом генетичного алгоритму.The master's dissertation contains 102 pages, 75 figures, 14 tables, 1 appendix, 30 sources. Object of research: parallel embedded systems. The purpose of the master's dissertation: to increase the efficiency of optimization of applications by tailing Subject of research: automated system for creating optimized user applications for parallel embedded systems. The scientific novelty of the results obtained in the master's dissertation is to improve the efficiency of optimization of applications by tailing, namely - in the implementation of the search for optimal tile sizes by genetic algorithm.

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree
    corecore