306 research outputs found

    A data-reuse aware accelerator for large-scale convolutional networks

    Get PDF
    This paper presents a clustered SIMD accelerator template for Convolutional Networks. These networks significantly outperform other methods in detection and classification tasks in the vision domain. Due to the excessive compute and data transfer requirements these applications benefit a lot from a dedicated accelerator. The proposed accelerator reduces memory traffic by loop transformations such as tiling and fusion to merge successive layers. Although fusion can introduce redundant computations it often reduces the data transfer, and therefore can remove performance bottlenecks. The SIMD cluster is mapped to a Xilinx Zynq FPGA, which can achieve 6.4 Gops performance with a small amount of resources. The performance can be scaled by using multiple clusters

    Optimal iteration scheduling for intra- and inter-tile reuse in nested loop accelerators

    Get PDF
    High Level Synthesis tools have reduced accelerator design time. However, a complex scaling problem that remains is the data transfer bottleneck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by exploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. However, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new methodology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially

    Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators

    Get PDF
    The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor

    A data-reuse aware accelerator for large-scale convolutional networks

    Get PDF
    This paper presents a clustered SIMD accelerator template for Convolutional Networks. These networks significantly outperform other methods in detection and classification tasks in the vision domain. Due to the excessive compute and data transfer requirements these applications benefit a lot from a dedicated accelerator. The proposed accelerator reduces memory traffic by loop transformations such as tiling and fusion to merge successive layers. Although fusion can introduce redundant computations it often reduces the data transfer, and therefore can remove performance bottlenecks. The SIMD cluster is mapped to a Xilinx Zynq FPGA, which can achieve 6.4 Gops performance with a small amount of resources. The performance can be scaled by using multiple clusters

    Enabling MPSoC design space exploration on FPGAs

    Get PDF
    Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. These multiprocessors are increasingly becoming heterogeneous for reasons of cost and power. Design space exploration (DSE) of application mapping becomes a major design decision in such systems. The time spent in DSE becomes even greater with multiple applications executing concurrently. Methods have been proposed to automate generation of multiprocessor designs and prototype them on FPGAs. However, only few are able to support heterogeneous platforms. This is because heterogeneous processors require different types of inter-processor communication interfaces. So when we choose a different processor for a particular task, the communication infrastructure of the processor also has to change. In this paper, we present a module that integrates in a multiprocessor design generation flow and allows heterogeneous platform generation. This module is area efficient and fast. The DSE shows that up to 31% FPGA area can be saved when heterogeneous design is used as compared to a homogeneous platform. Moreover, the performance of the application also improves significantly

    DC-SIMD: dynamic communication for SIMD processors

    Get PDF
    SIMD (single instruction multiple data)-type processors have been found very efficient in image processing applications, because their repetitive structure is able to exploit the huge amount of data-level parallelism in pixel-type operations, operating at a relatively low energy consumption rate. However, current SIMD architectures lack support for dynamic communication between processing elements, which is needed to efficiently map a set of non-linear algorithms. An architecture for dynamic communication support has been proposed, but this architecture needs large amounts of buffering to function properly. In this paper, three architectures supporting dynamic communication without the need of large amounts of buffering are presented, requiring 98% less buffer space. Cycle-true communication architecture simulators have been developed to accurately predict the performance of the different architectures. Simulations with several test algorithms have shown a performance improvement of up to 5x compared to a locally connected SIMD-processor. Also, detailed area models have been developed, estimating the three proposed architectures to have an area overhead of 30-70% compared to a locally connected SIMD architecture (like the IMAP). When memory is taken into account as well, the overhead is estimated to be 13-28%

    Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

    Full text link
    The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor

    Fast Huffman decoding by exploiting data level parallelism

    Get PDF
    The frame rates and resolutions of digital videos are on the rising edge. Thereby, pushing the compression ratios of video coding standards to their limits, resulting in more complex and computational power hungry algorithms. Programmable solutions are gaining interest to keep up the pace of the evolving video coding standards, by reducing the time-to-market of upcoming video products. However, to compete with hardwired solutions, parallelism needs to be exploited on as many levels as possible. In this paper the focus will be on data level parallelism. Huffman coding is proven to be very efficient and therefore commonly applied in many coding standards. However, due to the inherently sequential nature, parallelization of the Huffman decoding is considered hard. The proposed fully flexible and programmable acceleration exploits available data level parallelism in Huffman decoding. Our implementation achieves a decoding speed of 106 MBit/s while running on a 250 MHz processor. This is a speed-up of 24× compared to our sequential reference implementation
    • …
    corecore