Search CORE

2,002 research outputs found

A general framework for efficient FPGA implementation of matrix product

Author: Amira A.
Bensaali F.
Sotudeh R.
Publication venue
Publication date: 01/01/2007
Field of study

Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

University of Hertfordshire Research Archive

Recommended from our members

Efficient FPGA implementation and power modelling of image and signal processing IP cores

Author: Chandrasekaran Shrutisagar
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2007
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage and signal processing application areas such as consumer electronics, instrumentation, medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area. A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed

Brunel University Research Archive

Design & Implementation of FPGA-based Multi-standard Software Radio Receiver

Author: Alam Muhammad Mahtab
Awan Mehmood
Publication venue: Aalborg University
Publication date: 01/01/2007
Field of study

VBN

Joint Optimization of Low-power DCT Architecture and Effcient Quantization Technique for Embedded Image Compression

Author: Alfalou Ayman
Jridi Maher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/09/2010
Field of study

International audienceThe Discrete Cosine Transform (DCT)-based image com- pression is widely used in today's communication systems. Signi cant research devoted to this domain has demonstrated that the optical com- pression methods can o er a higher speed but su er from bad image quality and a growing complexity. To meet the challenges of higher im- age quality and high speed processing, in this chapter, we present a joint system for DCT-based image compression by combining a VLSI archi- tecture of the DCT algorithm and an e cient quantization technique. Our approach is, rstly, based on a new granularity method in order to take advantage of the adjacent pixel correlation of the input blocks and to improve the visual quality of the reconstructed image. Second, a new architecture based on the Canonical Signed Digit and a novel Common Subexpression Elimination technique is proposed to replace the constant multipliers. Finally, a recon gurable quantization method is presented to e ectively save the computational complexity. Experimental results obtained with a prototype based on FPGA implementation and com- parisons with existing works corroborate the validity of the proposed optimizations in terms of power reduction, speed increase, silicon area saving and PSNR improvement

HAL Descartes

Hal-Diderot

Design of approximate overclocked datapath

Author: Shi Kan
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/03/2016
Field of study

Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

Spiral - Imperial College Digital Repository

High Speed and Low Latency ECC Implementation over GF(2m) on FPGA

Author: Benaissa M.
Khan Z.U.A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

In this paper, a novel high-speed elliptic curve cryptography (ECC) processor implementation for point multiplication (PM) on field-programmable gate array (FPGA) is proposed. A new segmented pipelined full-precision multiplier is used to reduce the latency, and the Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs' Virtex4, Virtex5, and Virtex7 families. To the best of our knowledge, our single- and three-multiplier-based designs show the fastest performance to date when compared with reported works individually. Our one-multiplier-based ECC processor also achieves the highest reported speed together with the best reported area-time performance on Virtex4 (5.32 μs at 210 MHz), on Virtex5 (4.91 μs at 228 MHz), and on the more advanced Virtex7 (3.18 μs at 352 MHz). Finally, the proposed three-multiplier-based ECC implementation is the first work reporting the lowest number of CCs and the fastest ECC processor design on FPGA (450 CCs to get 2.83 μs on Virtex7)

Crossref

White Rose Research Online

Modified DA based FIR Filter in Multirate DSP systems on FPGA

Author: Mrs. Bhagyalakshmi. N, Dr. Rekha K. R, Dr. Nataraj K. R
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2016
Field of study

Multirate systems are popular in DSP.Systems which employ multiple sampling rates in the processing of digital signals are called Multirate DSP systems, which are used in audio, video processing and communication systems. Multirate DSP systems that employ different chips for different frequency signal results in more area and power utilization. The setback can be avoided by implementing Multirate system, based on Distributed Arithmetic FIR filter. Using such systems, we can achieve computation efficiency and improve the system performance. Modified DA based FIR Filter using Multirate systems includes decimation, interpolation process implemented on FPGA with 53% less LUT utilization compared to existing Multirate system

International Journal on Recent and Innovation Trends in Computing and Communication

FPGA BASED PARALLEL IMPLEMENTATION OF STACKED ERROR DIFFUSION ALGORITHM

Author: Kora Venugopal Rishvanth
Publication venue: UKnowledge
Publication date: 01/01/2010
Field of study

Digital halftoning is a crucial technique used in digital printers to convert a continuoustone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This thesis focuses on the development and design of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. The algorithm is described in ‘C’ and requires a significant processing time when implemented on a conventional CPU. Thus, a new hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a Xilinx Virtex 5 FPGA chip. There is an extraordinary decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU. The new parallel architecture is described using the Verilog Hardware Description Language. Post-synthesis and post-implementation, performance based Hardware Description Language (HDL), simulation validation of the new parallel architecture is achieved via use of the ModelSim CAD simulation tool

University of Kentucky

A high-performance inner-product processor for real and complex numbers.

Author: Wang Guoping.
Publication venue
Publication date: 01/01/2003
Field of study

A novel, high-performance fixed-point inner-product processor based on a redundant binary number system is investigated in this dissertation. This scheme decreases the number of partial products to 50%, while achieving better speed and area performance, as well as providing pipeline extension opportunities. When modified Booth coding is used, partial products are reduced by almost 75%, thereby significantly reducing the multiplier addition depth. The design is applicable for digital signal and image processing applications that require real and/or complex numbers inner-product arithmetic, such as digital filters, correlation and convolution. This design is well suited for VLSI implementation and can also be embedded as an inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic control of the computing structure permits different computations, such as a variety of inner-product real and complex number computations, parallel multiplication for real and complex numbers, and real and complex number division. The same structure can also be controlled to accept redundant binary number inputs for multiplication and inner-product computations. An improved 2's-complement to redundant binary converter is also presented

SHAREOK repository

Symbol Synchronization for SDR Using a Polyphase Filterbank Based on an FPGA

Author: Fiala P.
Linhart R.
Publication venue: 'Brno University of Technology'
Publication date: 01/09/2015
Field of study

This paper is devoted to the proposal of a highly efficient symbol synchronization subsystem for Software Defined Radio. The proposed feedback phase-locked loop timing synchronizer is suitable for parallel implementation on an FPGA. The polyphase FIR filter simultaneously performs matched-filtering and arbitrary interpolation between acquired samples. Determination of the proper sampling instant is achieved by selecting a suitable polyphase filterbank using a derived index. This index is determined based on the output either the Zero-Crossing or Gardner Timing Error Detector. The paper will extensively focus on simulation of the proposed synchronization system. On the basis of this simulation, a complete, fully pipelined VHDL description model is created. This model is composed of a fully parallel polyphase filterbank based on distributed arithmetic, timing error detector and interpolation control block. Finally, RTL synthesis on an Altera Cyclone IV FPGA is presented and resource utilization in comparison with a conventional model is analyzed

Directory of Open Access Journals

Digital library of Brno University of Technology