2,263 research outputs found
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Development of Lifting-based VLSI Architectures for Two-Dimensional Discrete Wavelet Transform
Two-dimensional discrete wavelet transform (2-D DWT) has evolved as an essential
part of a modem compression system. It offers superior compression with good image
quality and overcomes disadvantage of the discrete cosine transform, which suffers
from blocks artifacts that reduces the quality of the inage. The amount of
computations involve in 2-D DWT is enormous and cannot be processed by generalpurpose
processors when real-time processing is required. Th·"efore, high speed and
low power VLSI architecture that computes 2-D DWT effectively is needed. In this
research, several VLSI architectures have been developed that meets real-time
requirements for 2-D DWT applications. This research iaitially started off by
implementing a software simulation program that decorrelates the original image and
reconstructs the original image from the decorrelated image. Then, based on the
information gained from implementing the simulation program, a new approach for
designing lifting-based VLSI architectures for 2-D forward DWT is introduced. As a
result, two high performance VLSI architectures that perform 2-D DWT for 5/3 and
9/7 filters are developed based on overlapped and nonoverlapped scan methods. Then,
the intermediate architecture is developed, which aim a·: reducing the power
consumption of the overlapped areas without using the expensive line buffer. In order
to best meet real-time applications of 2-D DWT with demanding requirements in
terms of speed and throughput parallelism is explored. The single pipelined
intermediate and overlapped architectures are extended to 2-, 3-, and 4-parallel
architectures to achieve speed factors of 2, 3, and 4, respectively. To further
demonstrate the effectiveness of the approach single and para.llel VLSI architectures
for 2-D inverse discrete wavelet transform (2-D IDWT) are developed. Furthermore,
2-D DWT memory architectures, which have been overlooked in the literature, are
also developed. Finally, to show the architectural models developed for 2-D DWT are
simple to control, the control algorithms for 4-parallel architecture based on the first
scan method is developed. To validate architectures develcped in this work five
architectures are implemented and simulated on Altera FPGA.
In compliance with the terms of the Copyright Act 1987 and the IP Policy of the
university, the copyright of this thesis has been reassigned by the author to the legal
entity of the university,
Institute of Technology PETRONAS Sdn bhd.
Due acknowledgement shall always be made of the use of any material contained
in, or derived from, this thesis
High-level synthesis optimization for blocked floating-point matrix multiplication
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
- …