53 research outputs found

    HEVC 2D-DCT architectures comparison for FPGA and ASIC implementations

    Get PDF
    This paper compares ASIC and FPGA implementations of two commonly used architectures for 2-dimensional discrete cosine transform (DCT), the parallel and folded architectures. The DCT has been designed for sizes 4x4, 8x8, and 16x16, and implemented on Silterra 180nm ASIC and Xilinx Kintex Ultrascale FPGA. The objective is to determine suitable low energy architectures to be used as their characteristics greatly differ in terms of cells usage, placement and routing methods on these platforms. The parallel and folded DCT architectures for all three sizes have been designed using Verilog HDL, including the basic serializer-deserializer input and output. Results show that for large size transform of 16x16, ASIC parallel architecture results in roughly 30% less energy compared to folded architecture. As for FPGAs, folded architecture results in roughly 34% less energy compared to parallel architecture. In terms of overall energy consumption between 180nm ASIC and Xilinx Ultrascale, ASIC implementation results in about 58% less energy compared to the FPGA

    Low energy video processing and compression hardware designs

    Get PDF
    Digital video processing and compression algorithms are used in many commercial products such as mobile devices, unmanned aerial vehicles, and autonomous cars. Increasing resolution of videos used in these commercial products increased computational complexities of digital video processing and compression algorithms. Therefore, it is necessary to reduce computational complexities of digital video processing and compression algorithms, and energy consumptions of digital video processing and compression hardware without reducing visual quality. In this thesis, we propose a novel adaptive 2D digital image processing algorithm for 2D median filter, Gaussian blur and image sharpening. We designed low energy 2D median filter, Gaussian blur and image sharpening hardware using the proposed algorithm. We propose approximate HEVC intra prediction and HEVC fractional interpolation algorithms. We designed low energy approximate HEVC intra prediction and HEVC fractional interpolation hardware. We also propose several HEVC fractional interpolation hardware architectures. We propose novel computational complexity and energy reduction techniques for HEVC DCT and inverse DCT/DST. We designed high performance and low energy hardware for HEVC DCT and inverse DCT/DST including the proposed techniques. VII We quantified computation reductions achieved and video quality loss caused by the proposed algorithms and techniques. We implemented the proposed hardware architectures in Verilog HDL. We mapped the Verilog RTL codes to Xilinx Virtex 6 and Xilinx ZYNQ FPGAs, and estimated their power consumptions using Xilinx XPower Analyzer tool. The proposed algorithms and techniques significantly reduced the power and energy consumptions of these FPGA implementations in some cases with no PSNR loss and in some cases with very small PSNR loss

    A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

    Get PDF
    Using discrete wavelet transform (DWT) in high-speed signal-processing applications imposes a high degree of care to hardware resource availability, latency, and power consumption. In this chapter, the design aspects and performance of multiplierless DWT is analyzed. We presented the two key multiplierless approaches, namely the distributed arithmetic algorithm (DAA) and the residue number system (RNS). We aim to estimate the performance requirements and hardware resources for each approach, allowing for the selection of proper algorithm and implementation of multi-level DAA- and RNS-based DWT. The design has been implemented and synthesized in Xilinx Virtex 6 ML605, taking advantage of Virtex 6’s embedded block RAMs (BRAMs)

    An FPGA Implementation of HW/SW Codesign Architecture for H.263 Video Coding

    Get PDF
    Chapitre 12 http://www.intechopen.com/download/pdf/pdfs_id/1574

    FGPA implementations of motion estimation algorithms using Vivado high level synthesis

    Get PDF
    Joint collaborative team on video coding (JCT-VC) recently developed a new international video compression standard called High Efficiency Video Coding (HEVC). HEVC has 50% better compression efficiency than previous H.264 video compression standard. HEVC achieves this video compression efficiency by significantly increasing the computational complexity. Motion estimation is the most computationally complex part of video encoders. Integer motion estimation and fractional motion estimation account for 70% of the computational complexity of an HEVC video encoder. High-level synthesis (HLS) tools are started to be successfully used for FPGA implementations of digital signal processing algorithms. They significantly decrease design and verification time. Therefore, in this thesis, we proposed the first FPGA implementation of HEVC full search motion estimation using Vivado HLS. Then, we proposed the first FPGA implementations of two fast search (diamond search and TZ search) algorithms using Vivado HLS. Finally, we proposed the first FPGA implementations of HEVC fractional interpolation and motion estimation using Vivado HLS. We used several HLS optimization directives to increase performance and decrease area of these FPGA implementations

    Design of 2D discrete cosine transform using CORDIC architectures in VHDL

    Get PDF
    The Discrete Cosine Transform is one of the most widely transform techniques in digital signal processing. In addition, this is also most computationally intensive transforms which require many multiplications and additions. Real time data processing necessitates the use of special purpose hardware which involves hardware efficiency as well as high throughput. Many DCT algorithms were proposed in order to achieve high speed DCT. Those architectures which involves multipliers, for example Chen’s algorithm has less regular architecture due to complex routing and requires large silicon area. On the other hand, the DCT architecture based on distributed arithmetic (DA) which is also a multiplier less architecture has the inherent disadvantage of less throughputs because of the ROM access time and the need of accumulator. Also this DA algorithm requires large silicon area if it requires large ROM size. Systolic array architecture for the real-time DCT computation may have the large number of gates and clock skew problem. The other ways of implementation of DCT which involves in multiplierless, thus power efficient and which results in regular architecture and less complicated routing, consequently less area, simultaneously lead to high throughput. So for that purpose CORDIC seems to be a best solution. CORDIC offers a unified iterative formulation to efficiently evaluate the rotation operation. This thesis presents the implementation of 2D Discrete Cosine Transform (DCT) using the Angle Recoded (AR) Cordic algorithm, the new scaling less CORDIC algorithm and the conventional Chen’s algorithm which is multiplier dependant algorithm. The 2D DCT is implemented by exploiting the Separability property of 2D Discrete Cosine Transform. Here first one dimensional DCT is designed first and later a transpose buffer which consists of 64 memory elements, fully pipelined is designed. Later all these blocks are joined with the help of a controller element which is a mealy type FSM which produces some status signals also. The three resulting architectures are all well synthesized in Xilinx 9.1ise, simulated in Modelsim 5.6f and the power is calculated in Xilinx Xpower. Results prove that AR Cordic algorithm is better than Chen’s algorithm, even the new scaling less CORDIC algorithm

    Efficient architectures for multidimensional discrete transforms in image and video processing applications

    Get PDF
    PhD ThesisThis thesis introduces new image compression algorithms, their related architectures and data transforms architectures. The proposed architectures consider the current hardware architectures concerns, such as power consumption, hardware usage, memory requirement, computation time and output accuracy. These concerns and problems are crucial in multidimensional image and video processing applications. This research is divided into three image and video processing related topics: low complexity non-transform-based image compression algorithms and their architectures, architectures for multidimensional Discrete Cosine Transform (DCT); and architectures for multidimensional Discrete Wavelet Transform (DWT). The proposed architectures are parameterised in terms of wordlength, pipelining and input data size. Taking such parameterisation into account, efficient non-transform based and low complexity image compression algorithms for better rate distortion performance are proposed. The proposed algorithms are based on the Adaptive Quantisation Coding (AQC) algorithm, and they achieve a controllable output bit rate and accuracy by considering the intensity variation of each image block. Their high speed, low hardware usage and low power consumption architectures are also introduced and implemented on Xilinx devices. Furthermore, efficient hardware architectures for multidimensional DCT based on the 1-D DCT Radix-2 and 3-D DCT Vector Radix (3-D DCT VR) fast algorithms have been proposed. These architectures attain fast and accurate 3-D DCT computation and provide high processing speed and power consumption reduction. In addition, this research also introduces two low hardware usage 3-D DCT VR architectures. Such architectures perform the computation of butterfly and post addition stages without using block memory for data transposition, which in turn reduces the hardware usage and improves the performance of the proposed architectures. Moreover, parallel and multiplierless lifting-based architectures for the 1-D, 2-D and 3-D Cohen-Daubechies-Feauveau 9/7 (CDF 9/7) DWT computation are also introduced. The presented architectures represent an efficient multiplierless and low memory requirement CDF 9/7 DWT computation scheme using the separable approach. Furthermore, the proposed architectures have been implemented and tested using Xilinx FPGA devices. The evaluation results have revealed that a speed of up to 315 MHz can be achieved in the proposed AQC-based architectures. Further, a speed of up to 330 MHz and low utilisation rate of 722 to 1235 can be achieved in the proposed 3-D DCT VR architectures. In addition, in the proposed 3-D DWT architecture, the computation time of 3-D DWT for data size of 144×176×8-pixel is less than 0.33 ms. Also, a power consumption of 102 mW at 50 MHz clock frequency using 256×256-pixel frame size is achieved. The accuracy tests for all architectures have revealed that a PSNR of infinite can be attained
    corecore