6 research outputs found
Using Reduced Graphs for Efficient HLS Scheduling
High-Level Synthesis (HLS) is the process of inferring a digital circuit from a high-level algorithmic description provided as a software implementation, usually in C/C++. HLS tools will parse the input code and then perform three main steps: allocation, scheduling, and binding. This results in a hardware architecture which can then be represented as a Register-Transfer Level (RTL) model using a Hardware Description Language (HDL), such as VHDL or Verilog. Allocation determines the amount of resources needed, scheduling finds the order in which operations should occur, and binding maps operations onto the allocated hardware resources. Two main challenges of scheduling are in its computational complexity and memory requirements. Finding an optimal schedule is an NP-hard problem, so many tools use elaborate heuristics to find a solution which satisfies prescribed implementation constraints. These heuristics require the Control/Data Flow Graph (CDFG), a representation of all operations and their dependencies, which must be stored in its entirety and therefore use large amounts of memory.
This thesis presents a new scheduling approach for use in the HLS tool chain. The new technique schedules operations using an algorithm which operates on a reduced representation of the graph, which does not need to retain individual dependency information in order to generate a schedule. By using the simplified graph, the complexity of scheduling is significantly reduced, resulting in improved memory usage and lower computational effort. This new scheduler is implemented and compared to the existing scheduler in the open source version of the LegUp HLS tool. The results demonstrate that an average of 16 times speedup on the time required to determine the schedule can be achieved, with just a fraction of the memory usage (1/5 on average). All of this is achieved with 0 to 6% of added cost on the final hardware execution time
Platform-Aware FPGA System Architecture Generation based on MLIR
FPGA acceleration is becoming increasingly important to meet the performance
demands of modern computing, particularly in big data or machine learning
applications. As such, significant effort is being put into the optimization of
the hardware accelerators. However, integrating accelerators into modern FPGA
platforms, with key features such as high bandwidth memory (HBM), requires
manual effort from a platform expert for every new application. We propose the
Olympus multi-level intermediate representation (MLIR) dialect and Olympus-opt,
a series of analysis and transformation passes on this dialect, for
representing and optimizing platform aware system level FPGA architectures. By
leveraging MLIR, our automation will be extensible and reusable both between
many sources of input and many platform-specific back-ends.Comment: Accepted for presentation at the CPS workshop 2023
(http://www.cpsschool.eu/cps-workshop
Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization
Optimizing data movements is becoming one of the biggest challenges in
heterogeneous computing to cope with data deluge and, consequently, big data
applications. When creating specialized accelerators, modern high-level
synthesis (HLS) tools are increasingly efficient in optimizing the
computational aspects, but data transfers have not been adequately improved. To
combat this, novel architectures such as High-Bandwidth Memory with wider data
busses have been developed so that more data can be transferred in parallel.
Designers must tailor their hardware/software interfaces to fully exploit the
available bandwidth. HLS tools can automate this process, but the designer must
follow strict coding-style rules. If the bus width is not evenly divisible by
the data width (e.g., when using custom-precision data types) or if the arrays
are not power-of-two length, the HLS-generated accelerator will likely not
fully utilize the available bandwidth, demanding even more manual effort from
the designer. We propose a methodology to automatically find and implement a
data layout that, when streamed between memory and an accelerator, uses a
higher percentage of the available bandwidth than a naive or HLS-optimized
design. We borrow concepts from multiprocessor scheduling to achieve such high
efficiency.Comment: Accepted for presentation at ASPDAC'2
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts.
In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example.
Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth.
We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations
Compiler Infrastructure for Specializing Domain-Specific Memory Templates
Specialized hardware accelerators are becoming important for more and more
applications. Thanks to specialization, they can achieve high performance and
energy efficiency but their design is complex and time consuming. This problem
is exacerbated when large amounts of data must be processed, like in modern big
data and machine learning applications. The designer has not only to optimize
the accelerator logic but also produce efficient memory architectures. To
simplify this process, we propose a multi-level compilation flow that
specializes a domain-specific memory template to match data, application, and
technology requirements
From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics
Many applications are increasingly requiring numerical simulations for solving complex problems. Most of these numerical algorithms are massively parallel and often implemented on parallel high-performance computers. However, classic CPU-based platforms suffer due to the demand for higher resolutions and the exponential growth of data. FPGAs offer a powerful and flexible alternative that can host accelerators to complement such platforms. Developing such application-specific accelerators is still challenging because it is hard to provide efficient code for hardware synthesis. In this paper, we study the challenges of porting a numerical simulation kernel onto FPGA.
We propose an automated tool flow from a domain-specific language (DSL) to generate accelerators for computational fluid dynamics on FPGA. Our DSL-based flow simplifies the exploration of parameters and constraints such as on-chip memory usage.
We also propose a decoupled optimization of memory and logic resources, which allows us to better use the limited FPGA resources.
In our preliminary evaluation, this enabled doubling the number of parallel kernels, increasing the accelerator speedup versus ARM execution from 7 to 12 times