Search CORE

6 research outputs found

Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor

Repository TU/e

Crossref

Pure OAI Repository

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Author: Corporaal H.
Mesman B.
Peemen M.C.J.
Publication venue: 'EDAA'
Publication date: 01/01/2015
Field of study

Crossref

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Author: Alwani M.
Krizhevsky Alex
Li Huimin
van den Oord Aäron
Publication venue
Publication date: 12/04/2018
Field of study

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x

arXiv.org e-Print Archive

Crossref

Demystifying the 16 × 16 thread-block for stencils on the GPU

Author: Fernandez
Gomez-Luna
Podlozhnyuk
Ragan-Kelley
Tabik
Torres
White
Zhao
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Automatic mapping of convolutional networks on the Neuro Vector Engine

Author: Pramadi W.
Publication venue
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository