2,263 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Development of Lifting-based VLSI Architectures for Two-Dimensional Discrete Wavelet Transform

    Get PDF
    Two-dimensional discrete wavelet transform (2-D DWT) has evolved as an essential part of a modem compression system. It offers superior compression with good image quality and overcomes disadvantage of the discrete cosine transform, which suffers from blocks artifacts that reduces the quality of the inage. The amount of computations involve in 2-D DWT is enormous and cannot be processed by generalpurpose processors when real-time processing is required. Th·"efore, high speed and low power VLSI architecture that computes 2-D DWT effectively is needed. In this research, several VLSI architectures have been developed that meets real-time requirements for 2-D DWT applications. This research iaitially started off by implementing a software simulation program that decorrelates the original image and reconstructs the original image from the decorrelated image. Then, based on the information gained from implementing the simulation program, a new approach for designing lifting-based VLSI architectures for 2-D forward DWT is introduced. As a result, two high performance VLSI architectures that perform 2-D DWT for 5/3 and 9/7 filters are developed based on overlapped and nonoverlapped scan methods. Then, the intermediate architecture is developed, which aim a·: reducing the power consumption of the overlapped areas without using the expensive line buffer. In order to best meet real-time applications of 2-D DWT with demanding requirements in terms of speed and throughput parallelism is explored. The single pipelined intermediate and overlapped architectures are extended to 2-, 3-, and 4-parallel architectures to achieve speed factors of 2, 3, and 4, respectively. To further demonstrate the effectiveness of the approach single and para.llel VLSI architectures for 2-D inverse discrete wavelet transform (2-D IDWT) are developed. Furthermore, 2-D DWT memory architectures, which have been overlooked in the literature, are also developed. Finally, to show the architectural models developed for 2-D DWT are simple to control, the control algorithms for 4-parallel architecture based on the first scan method is developed. To validate architectures develcped in this work five architectures are implemented and simulated on Altera FPGA. In compliance with the terms of the Copyright Act 1987 and the IP Policy of the university, the copyright of this thesis has been reassigned by the author to the legal entity of the university, Institute of Technology PETRONAS Sdn bhd. Due acknowledgement shall always be made of the use of any material contained in, or derived from, this thesis

    High-level synthesis optimization for blocked floating-point matrix multiplication

    Get PDF
    In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS
    corecore