Search CORE

4,032 research outputs found

High-level synthesis optimization for blocked floating-point matrix multiplication

Author: D'Hollander Erik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices

Ghent University Academic Bibliography

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

Author: Rasnayake Lahiru
Sjalander Magnus
Umuroglu Yaman
Publication venue
Publication date: 01/01/2018
Field of study

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

Efficient FPGA implementation of high-throughput mixed radix multipath delay commutator FFT processor for MIMO-OFDM

Author: A. AMIRA
A. GUESSOUM
Ayinala
Bingham
Boopal
Chen
Fu
Garrido
Garrido
Gesbert
Li
Lin
Lin
M. DALI
N. RAMZAN
R. M. GIBSON
Sampath
Shousheng He
Shousheng He
Song-Nien Tang
Swartzlander
Tang
Tsai
Uzun
Wang
Wang
Yang
Yu-Wei Lin
Publication venue: 'Universitatea Stefan cel Mare din Suceava'
Publication date: 01/01/2017
Field of study

This article presents and evaluates pipelined architecture designs for an improved high-frequency Fast Fourier Transform (FFT) processor implemented on Field Programmable Gate Arrays (FPGA) for Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM). The architecture presented is a Mixed-Radix Multipath Delay Commutator. The presented parallel architecture utilizes fewer hardware resources compared to Radix-2 architecture, while maintaining simple control and butterfly structures inherent to Radix-2 implementations. The high-frequency design presented allows enhancing system throughput without requiring additional parallel data paths common in other current approaches, the presented design can process two and four independent data streams in parallel and is suitable for scaling to any power of two FFT size N. FPGA implementation of the architecture demonstrated significant resource efficiency and high-throughput in comparison to relevant current approaches within literature. The proposed architecture designs were realized with Xilinx System Generator (XSG) and evaluated on both Virtex-5 and Virtex-7 FPGA devices. Post place and route results demonstrated maximum frequency values over 400 MHz and 470 MHz for Virtex-5 and Virtex-7 FPGA devices respectively

Crossref

Directory of Open Access Journals

Research Repository and Portal - University of the West of Scotland

ResearchOnline@GCU

FIR filter optimization for video processing on FPGAs

Author
Publication venue: Springer
Publication date: 25/05/2013
Field of study

Springer - Publisher Connector

Memory-efficient and fast run-time reconfiguration of regularly structured designs

Author: Al Farisi Brahim
Bruneel Karel
Heyse Karel
Stroobandt Dirk
Publication venue
Publication date: 01/01/2011
Field of study

Previous work has shown that run-time reconfiguration of FPGAs benefits greatly from the use of Tunable LUT (TLUT) circuits. These can be rapidly transformed into a specialized LUT circuit and are also very memory efficient when representing regularly structured designs, where the same hardware module is instantiated many times. However, the memory requirements and reconfiguration time of a run-time reconfigurable application are also dependent on the reconfiguration mechanism. In this paper, we will show that the memory requirements of conventional ICAP reconfiguration grow very fast with the number of modules, resulting in excessive memory usage. We propose to use Shift-Register-LUT (SRL) reconfiguration which is faster and results in a memory usage that is independent of the number of modules

Crossref

Ghent University Academic Bibliography