8 research outputs found

    Joint Optimization of Low-power DCT Architecture and Effcient Quantization Technique for Embedded Image Compression

    Get PDF
    International audienceThe Discrete Cosine Transform (DCT)-based image com- pression is widely used in today's communication systems. Signi cant research devoted to this domain has demonstrated that the optical com- pression methods can o er a higher speed but su er from bad image quality and a growing complexity. To meet the challenges of higher im- age quality and high speed processing, in this chapter, we present a joint system for DCT-based image compression by combining a VLSI archi- tecture of the DCT algorithm and an e cient quantization technique. Our approach is, rstly, based on a new granularity method in order to take advantage of the adjacent pixel correlation of the input blocks and to improve the visual quality of the reconstructed image. Second, a new architecture based on the Canonical Signed Digit and a novel Common Subexpression Elimination technique is proposed to replace the constant multipliers. Finally, a recon gurable quantization method is presented to e ectively save the computational complexity. Experimental results obtained with a prototype based on FPGA implementation and com- parisons with existing works corroborate the validity of the proposed optimizations in terms of power reduction, speed increase, silicon area saving and PSNR improvement

    Investigation of a Novel Common Subexpression Elimination Method for Low Power and Area Efficient DCT Architecture

    Get PDF
    A wide interest has been observed to find a low power and area efficient hardware design of discrete cosine transform (DCT) algorithm. This research work proposed a novel Common Subexpression Elimination (CSE) based pipelined architecture for DCT, aimed at reproducing the cost metrics of power and area while maintaining high speed and accuracy in DCT applications. The proposed design combines the techniques of Canonical Signed Digit (CSD) representation and CSE to implement the multiplier-less method for fixed constant multiplication of DCT coefficients. Furthermore, symmetry in the DCT coefficient matrix is used with CSE to further decrease the number of arithmetic operations. This architecture needs a single-port memory to feed the inputs instead of multiport memory, which leads to reduction of the hardware cost and area. From the analysis of experimental results and performance comparisons, it is observed that the proposed scheme uses minimum logic utilizing mere 340 slices and 22 adders. Moreover, this design meets the real time constraints of different video/image coders and peak-signal-to-noise-ratio (PSNR) requirements. Furthermore, the proposed technique has significant advantages over recent well-known methods along with accuracy in terms of power reduction, silicon area usage, and maximum operating frequency by 41%, 15%, and 15%, respectively

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

    Compiling dataflow graphs into hardware

    Get PDF
    Department Head: L. Darrell Whitley.2005 Fall.Includes bibliographical references (pages 121-126).Conventional computers are programmed by supplying a sequence of instructions that perform the desired task. A reconfigurable processor is "programmed" by specifying the interconnections between hardware components, thereby creating a "hardwired" system to do the particular task. For some applications such as image processing, reconfigurable processors can produce dramatic execution speedups. However, programming a reconfigurable processor is essentially a hardware design discipline, making programming difficult for application programmers who are only familiar with software design techniques. To bridge this gap, a programming language, called SA-C (Single Assignment C, pronounced "sassy"), has been designed for programming reconfigurable processors. The process involves two main steps - first, the SA-C compiler analyzes the input source code and produces a hardware-independent intermediate representation of the program, called a dataflow graph (DFG). Secondly, this DFG is combined with hardware-specific information to create the final configuration. This dissertation describes the design and implementation of a system that performs the DFG to hardware translation. The DFG is broken up into three sections: the data generators, the inner loop body, and the data collectors. The second of these, the inner loop body, is used to create a computational structure that is unique for each program. The other two sections are implemented by using prebuilt modules, parameterized for the particular problem. Finally, a "glue module" is created to connect the various pieces into a complete interconnection specification. The dissertation also explores optimizations that can be applied while processing the DFG, to improve performance. A technique for pipelining the inner loop body is described that uses an estimation tool for the propagation delay of the nodes within the dataflow graph. A scheme is also described that identifies subgraphs with the dataflow graph that can be replaced with lookup tables. The lookup tables provide a faster implementation than random logic in some instances

    VLSI implementation of discrete cosine transform using a new asynchronous pipelined architecture.

    Get PDF
    Lee Chi-wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 191-196).Abstracts in English and Chinese.Abstract of this thesis entitled: --- p.i摘要 --- p.iiiAcknowledgements --- p.vTable of Contents --- p.viiList of Tables --- p.xList of Figures --- p.xiChapter Chapter1 --- Introduction --- p.1Chapter 1.1 --- Synchronous Design --- p.1Chapter 1.2 --- Asynchronous Design --- p.2Chapter 1.3 --- Discrete Cosine Transform --- p.4Chapter 1.4 --- Motivation --- p.5Chapter 1.5 --- Organization of the Thesis --- p.6Chapter Chapter2 --- Asynchronous Design Methodology --- p.7Chapter 2.1 --- Overview --- p.7Chapter 2.2 --- Background --- p.8Chapter 2.3 --- Past Designs --- p.10Chapter 2.4 --- Micropipeline --- p.12Chapter 2.5 --- New Asynchronous Architecture --- p.15Chapter Chapter3 --- DCT/IDCT Processor Design Methodology --- p.24Chapter 3.1 --- Overview --- p.24Chapter 3.2 --- Hardware Architecture --- p.25Chapter 3.3 --- DCT Algorithm --- p.26Chapter 3.4 --- Used Architecture and DCT Algorithm --- p.30Chapter 3.4.1 --- Implementation on Programmable DSP Processor --- p.31Chapter 3.4.2 --- Implementation on Dedicated Processor --- p.33Chapter Chapter4 --- New Techniques for Operating Dynamic Logic in Low Frequency --- p.36Chapter 4.1 --- Overview --- p.36Chapter 4.2 --- Background --- p.37Chapter 4.3 --- Traditional Technique --- p.39Chapter 4.4 --- New Technique - Refresh Control Circuit --- p.40Chapter 4.4.1 --- Principle --- p.41Chapter 4.4.2 --- Voltage Sensor --- p.42Chapter 4.4.3 --- Ring Oscillator --- p.43Chapter 4.4.4 --- "Counter, Latch and Comparator" --- p.46Chapter 4.4.5 --- Recalibrate Circuit --- p.47Chapter 4.4.6 --- Operation Monitoring Circuit --- p.48Chapter 4.4.7 --- Overall Circuit --- p.48Chapter Chapter5 --- DCT Implementation on Programmable DSP Processor --- p.51Chapter 5.1 --- Overview --- p.51Chapter 5.2 --- Processor Architecture --- p.52Chapter 5.2.1 --- Arithmetic Unit --- p.53Chapter 5.2.2 --- Switching Network --- p.56Chapter 5.2.3 --- FIFO Memory --- p.59Chapter 5.2.4 --- Instruction Memory --- p.60Chapter 5.3 --- Programming --- p.62Chapter 5.4 --- DCT Implementation --- p.63Chapter Chapter6 --- DCT Implementation on Dedicated DCT Processor --- p.66Chapter 6.1 --- Overview --- p.66Chapter 6.2 --- DCT Chip Architecture --- p.67Chapter 6.2.1 --- ID DCT Core --- p.68Chapter 6.2.1.1 --- Core Architecture --- p.74Chapter 6.2.1.2 --- Flow of Operation --- p.76Chapter 6.2.1.3 --- Data Replicator --- p.79Chapter 6.2.1.4 --- DCT Coefficients Memory --- p.80Chapter 6.2.2 --- Combination of IDCT to 1D DCT core --- p.82Chapter 6.2.3 --- Accuracy --- p.85Chapter 6.3 --- Transpose Memory --- p.87Chapter 6.3.1 --- Architecture --- p.89Chapter 6.3.2 --- Address Generator --- p.91Chapter 6.3.3 --- RAM Block --- p.94Chapter Chapter7 --- Results and Discussions --- p.97Chapter 7.1 --- Overview --- p.97Chapter 7.2 --- Refresh Control Circuit --- p.97Chapter 7.2.1 --- Implementation Results and Performance --- p.97Chapter 7.2.2 --- Discussion --- p.100Chapter 7.3 --- Programmable DSP Processor --- p.102Chapter 7.3.1 --- Implementation Results and Performance --- p.102Chapter 7.3.2 --- Discussion --- p.104Chapter 7.4 --- ID DCT/IDCT Core --- p.107Chapter 7.4.1 --- Simulation Results --- p.107Chapter 7.4.2 --- Measurement Results --- p.109Chapter 7.4.3 --- Discussion --- p.113Chapter 7.5 --- Transpose Memory --- p.122Chapter 7.5.1 --- Simulated Results --- p.122Chapter 7.5.2 --- Measurement Results --- p.123Chapter 7.5.3 --- Discussion --- p.126Chapter Chapter8 --- Conclusions --- p.130Appendix --- p.133Operations of switches in DCT implementation of programmable DSP processor --- p.133C Program for evaluating the error in DCT/IDCT core --- p.135Pin Assignments of the Programmable DSP Processor Chip --- p.142Pin Assignments of the 1D DCT/IDCT Core Chip --- p.144Pin Assignments of the Transpose Memory Chip --- p.147Chip microphotograph of the 1D DCT/IDCT core --- p.150Chip Microphotograph of the Transpose Memory --- p.151Measured Waveforms of 1D DCT/IDCT Chip --- p.152Measured Waveforms of Transpose Memory Chip --- p.156Schematics of Refresh Control Circuit --- p.158Schematics of Programmable DSP Processor --- p.164Schematics of 1D DCT/IDCT Core --- p.180Schematics of Transpose Memory --- p.187References --- p.191Design Libraries - CD-ROM --- p.19

    Optimized Architecture Using a Novel Subexpression Elimination on Loeffler Algorithm for DCT-Based Image Compression

    No full text
    International audienceThe canonical signed digit (CSD) representation of constant coefficients is a unique signed data representation containing the fewest number of nonzero bits. Consequently, for constant multipliers, the number of additions and subtractions is minimized by CSD representation of constant coefficients. This technique is mainly used for finite impulse response (FIR) filter by reducing the number of partial products. In this paper, we use CSD with a novel common subexpression elimination (CSE) scheme on the optimal Loeffler algorithm for the computation of discrete cosine transform (DCT). To meet the challenges of low-power and high-speed processing, we present an optimized image compression scheme based on two-dimensional DCT. Finally, a novel and a simple reconfigurable quantization method combined with DCT computation is presented to effectively save the computational complexity. We present here a new DCT architecture based on the proposed technique. From the experimental results obtained from the FPGA prototype we find that the proposed design has several advantages in terms of power reduction, speed performance, and saving of silicon area along with PSNR improvement over the existing designs as well as theXilinx core

    Optimized Architecture Using a Novel Subexpression Elimination on Loeffler Algorithm for DCT-Based Image Compression

    No full text
    International audienceThe canonical signed digit (CSD) representation of constant coefficients is a unique signed data representation containing thefewest number of nonzero bits. Consequently, for constant multipliers, the number of additions and subtractions is minimizedby CSD representation of constant coefficients. This technique is mainly used for finite impulse response (FIR) filter by reducingthe number of partial products. In this paper, we use CSD with a novel common subexpression elimination (CSE) scheme onthe optimal Loeffler algorithm for the computation of discrete cosine transform (DCT). To meet the challenges of low-power andhigh-speed processing, we present an optimized image compression scheme based on two-dimensional DCT. Finally, a novel anda simple reconfigurable quantization method combined with DCT computation is presented to effectively save the computationalcomplexity. We present here a new DCT architecture based on the proposed technique. From the experimental results obtainedfrom the FPGA prototype we find that the proposed design has several advantages in terms of power reduction, speed performance,and saving of silicon area along with PSNR improvement over the existing designs as well as the Xilinx core
    corecore