Search CORE

3 research outputs found

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Author: Cong Jason
Wei Peng
Yu Cody Hao
Zhang Peng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/07/2018
Field of study

CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

arXiv.org e-Print Archive

Crossref

Scipedia

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers

Author: Bingjun Xiao
Jason Cong
Peng Li
Peng Zhang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Extracting Data-Level Parallelism in High-Level Synthesis for Reconfigurable Architectures

Author: Escobedo Contreras Juan Andres
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2020
Field of study

High-Level Synthesis (HLS) tools are a set of algorithms that allow programmers to obtain implementable Hardware Description Language (HDL) code from specifications written high-level, sequential languages such as C, C++, or Java. HLS has allowed programmers to code in their preferred language while still obtaining all the benefits hardware acceleration has to offer without them needing to be intimately familiar with the hardware platform of the accelerator. In this work we summarize and expand upon several of our approaches to improve the automatic memory banking capabilities of HLS tools targeting reconfigurable architectures, namely Field-Programmable Gate Arrays or FPGA\u27s. We explored several approaches to automatically find the optimal partition factor and a usable banking scheme for stencil kernels including a tessellation based approach using multiple families of hyperplanes to do the partitioning which was able to find a better banking factor than current state-of-the-art methods and a graph theory methodology that allowed us to mathematically prove the optimality of our banking solutions. For non-stencil kernels we relaxed some of the conditions in our graph-based model to propose a best-effort solution to arbitrarily reduce memory access conflicts (simultaneous accesses to the same memory bank). We also proposed a non-linear transformation using prime factorization to convert a small subset of non-stencil kernels into stencil memory accesses, allowing us to use all previous work in memory partition to them. Our approaches were able to obtain better results than commercial tools and state-of-the-art algorithms in terms of reduced resource utilization and increased frequency of operation. We were also able to obtain better partition factors for some stencil kernels and usable baking schemes for non-stencil kernels with better performance than any applicable existing algorithm

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)