24 research outputs found

    A distributed arithmetic based CORDIC algorithm and its use in the FPGA implementation of the 2-D IDCT

    Get PDF
    The discrete cosine transform (DCT) based image compression techniques play an important role in today's digital applications. A video codec chip requires an integration of high-speed DCT and inverse DCT (IDCT) hardware units in a limited silicon space. This thesis presents a distributed arithmetic based CORDIC algorithm for the computation of the 1-D IDCT and an FPGA implementation of a cost-effective architecture for a 2-D IDCT processor using the proposed algorithm. The processor consisting of two 1-D IDCT cores, a transpose memory and a control logic block performs the 2-D IDCT computation by using the row-column decomposition approach. The basis of the proposed scheme is a combined use of the distributed arithmetic and the CORDIC algorithm in order to provide a small access time to the lookup tables and a reduced complexity for its architecture. In the proposed design, the deep pipeline structure of an existing CORDIC based architecture is replaced by much smaller DA-based ROM accumulators. In the proposed design, a bit-level digit-serial structure based on the redundant number system using an on-line algorithm is employed

    Joint Optimization of Low-power DCT Architecture and Effcient Quantization Technique for Embedded Image Compression

    Get PDF
    International audienceThe Discrete Cosine Transform (DCT)-based image com- pression is widely used in today's communication systems. Signi cant research devoted to this domain has demonstrated that the optical com- pression methods can o er a higher speed but su er from bad image quality and a growing complexity. To meet the challenges of higher im- age quality and high speed processing, in this chapter, we present a joint system for DCT-based image compression by combining a VLSI archi- tecture of the DCT algorithm and an e cient quantization technique. Our approach is, rstly, based on a new granularity method in order to take advantage of the adjacent pixel correlation of the input blocks and to improve the visual quality of the reconstructed image. Second, a new architecture based on the Canonical Signed Digit and a novel Common Subexpression Elimination technique is proposed to replace the constant multipliers. Finally, a recon gurable quantization method is presented to e ectively save the computational complexity. Experimental results obtained with a prototype based on FPGA implementation and com- parisons with existing works corroborate the validity of the proposed optimizations in terms of power reduction, speed increase, silicon area saving and PSNR improvement

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

    ASIC Implementation of Multiplexer Based DAA

    Get PDF
    ABSTRACT: In Digital Image Processing Point, Line and Edge detection are performed through software approach. The proposed Architecture performs these operations through hardware approach using Distributed Arithmetic. Distributed arithmetic (DA) has been widely used to implement inner product computations with fixed inputs. Conventional ROM-based DA suffers from large ROM requirements. To reduce the memory requirements, Adder based DA uses pre-defined structure for computation. But both the methods are suitable only if at least one input is constant. This project aims to implement a new Distributed Arithmetic Architecture for point detection, line detection and edge detection in DIP when both the inputs are variable. The new architecture is termed as Multiplexer based Distributed Arithmetic (MUX based DA). The proposed architecture takes the advantage of Multiplexer and DA for inner product computations when both the inputs are variable. In addition it reduces ROM requirement and complexity in constructing Adder based architecture for higher order inputs. Here, the performance of proposed Architecture with ROM based DA, Adder based DA and with multiplier based implementation are compared. The MUX based DA reduces power up to 81% and needs 40% of area as compared with multiplier based implementation. KEYWORDS: ROM based DA,ADDER based DA,MULTIPLEXER based DA, CADENCE 180nm Technology. I.INTRODUCTION Distributed Arithmetic (DA) has been widely adopted for its computational efficiency in many digital signal processing applications. The most frequently used form of computation in digital signal processing is a sum of products which is dot-product or inner-product generation. DA is generally abit-serial computation operation that forms a product of two vectors in one clock cycle. The typical applications include DCT, DFT (Discrete Fourier Transform), FIR (Finite Impulse Response), and DHT (Discrete Hartley Transform) which can be found in main stream multimedia standards and telecommunication protocols. The advantage of DA is its special non multiplication mechanization which uses adder replacing multiplication and therefore simplifies the hardware implementation. The idea behind the conventional DA, called ROM based, is to replace multiplication operations by pre-computing all possible values and storing these in a ROM. The Adder based DA uses a fixed architecture which can be obtained by distributing fixed variable is used for inner product computation. The DA technique distributes arithmetic operation rather than lumps themas multipliers do. Conventional DA called ROM based DA decomposes the variable input of the inner product into bit level to generate pre-computed data.ROM based DA uses a ROM table to store the pre-computed data, which makesit regular and efficient in silicon area in VLSI implementation. However, when the size of the inner product increases the ROM area increases exponentially and becomes impractically large, even using ROM partition. In contrast to conventional DA, Adder based DA decomposes the other operand of inner product into bit level, distributes the multiplication operation, and shares the common summation terms .The adder based DA exploits the distribution of binary value pattern and may maximize the hardware sharing possibility in the implementation. Although the Adder based DA requires less hardware area and smaller computation cycle time than ROM based DA, both the existing method operates only on one input as fixed but the proposed MUX base DA computes result with both the input as variable as same as MAC. The direct implementation of the filter requires more number of resources, to reduce the number of resources Distributed Arithmetic came into existence which replaces multiplications by additions and siftings. The proposed DA algorithm came into existence which uses multiplexers to remove the usage of ROM memory and complexity in constructing fixed architecture for higher order inputs. The proposed MUX based D

    Dynamically Reconfigurable Architectures and Systems for Time-varying Image Constraints (DRASTIC) for Image and Video Compression

    Get PDF
    In the current information booming era, image and video consumption is ubiquitous. The associated image and video coding operations require significant computing resources for both small-scale computing systems as well as over larger network systems. For different scenarios, power, bitrate and image quality can impose significant time-varying constraints. For example, mobile devices (e.g., phones, tablets, laptops, UAVs) come with significant constraints on energy and power. Similarly, computer networks provide time-varying bandwidth that can depend on signal strength (e.g., wireless networks) or network traffic conditions. Alternatively, the users can impose different constraints on image quality based on their interests. Traditional image and video coding systems have focused on rate-distortion optimization. More recently, distortion measures (e.g., PSNR) are being replaced by more sophisticated image quality metrics. However, these systems are based on fixed hardware configurations that provide limited options over power consumption. The use of dynamic partial reconfiguration with Field Programmable Gate Arrays (FPGAs) provides an opportunity to effectively control dynamic power consumption by jointly considering software-hardware configurations. This dissertation extends traditional rate-distortion optimization to rate-quality-power/energy optimization and demonstrates a wide variety of applications in both image and video compression. In each application, a family of Pareto-optimal configurations are developed that allow fine control in the rate-quality-power/energy optimization space. The term Dynamically Reconfiguration Architecture Systems for Time-varying Image Constraints (DRASTIC) is used to describe the derived systems. DRASTIC covers both software-only as well as software-hardware configurations to achieve fine optimization over a set of general modes that include: (i) maximum image quality, (ii) minimum dynamic power/energy, (iii) minimum bitrate, and (iv) typical mode over a set of opposing constraints to guarantee satisfactory performance. In joint software-hardware configurations, DRASTIC provides an effective approach for dynamic power optimization. For software configurations, DRASTIC provides an effective method for energy consumption optimization by controlling processing times. The dissertation provides several applications. First, stochastic methods are given for computing quantization tables that are optimal in the rate-quality space and demonstrated on standard JPEG compression. Second, a DRASTIC implementation of the DCT is used to demonstrate the effectiveness of the approach on motion JPEG. Third, a reconfigurable deblocking filter system is investigated for use in the current H.264/AVC systems. Fourth, the dissertation develops DRASTIC for all 35 intra-prediction modes as well as intra-encoding for the emerging High Efficiency Video Coding standard (HEVC)

    A VLSI implementation of a new simultaneous images compression and encryption method

    Full text link
    International audienceIn this manuscript, we describe a fully pipelined single chip architecture for implementing a new simultaneous image compression and encryption method suitable for real-time applications. The proposed method exploits the DCT properties to achieve the compression and the encryption simultaneously. First, to realize the compression, 8-point DCT applied to several images are done. Second, contrary to traditional compression algorithms, only some special points of DCT outputs are multiplexed. For the encryption process, a random number is generated and added to some specific DCT coefficients. On the other hand, to enhance the material implementation of the proposed method, a special attention is given to the DCT algorithm. In fact, a new way to realize the compression based on DCT algorithm and to reduce, at the same time, the material requirements of the compression process is presented. Simulation results show a compression ratio higher than 65% and a PSNR about 28 dB. The proposed architecture can be implemented in FPGA to yield a throughput of 206 MS/s which allows the processing of more than 30 frames per second for 1024x1024 images

    Domain-specific and reconfigurable instruction cells based architectures for low-power SoC

    Get PDF

    Implementation of a real time Hough transform using FPGA technology

    Get PDF
    This thesis is concerned with the modelling, design and implementation of efficient architectures for performing the Hough Transform (HT) on mega-pixel resolution real-time images using Field Programmable Gate Array (FPGA) technology. Although the HT has been around for many years and a number of algorithms have been developed it still remains a significant bottleneck in many image processing applications. Even though, the basic idea of the HT is to locate curves in an image that can be parameterized: e.g. straight lines, polynomials or circles, in a suitable parameter space, the research presented in this thesis will focus only on location of straight lines on binary images. The HT algorithm uses an accumulator array (accumulator bins) to detect the existence of a straight line on an image. As the image needs to be binarized, a novel generic synchronization circuit for windowing operations was designed to perform edge detection. An edge detection method of special interest, the canny method, is used and the design and implementation of it in hardware is achieved in this thesis. As each image pixel can be implemented independently, parallel processing can be performed. However, the main disadvantage of the HT is the large storage and computational requirements. This thesis presents new and state-of-the-art hardware implementations for the minimization of the computational cost, using the Hybrid-Logarithmic Number System (Hybrid-LNS) for calculating the HT for fixed bit-width architectures. It is shown that using the Hybrid-LNS the computational cost is minimized, while the precision of the HT algorithm is maintained. Advances in FPGA technology now make it possible to implement functions as the HT in reconfigurable fabrics. Methods for storing large arrays on FPGA’s are presented, where data from a 1024 x 1024 pixel camera at a rate of up to 25 frames per second are processed

    Power and Energy Aware Heterogeneous Computing Platform

    Get PDF
    During the last decade, wireless technologies have experienced significant development, most notably in the form of mobile cellular radio evolution from GSM to UMTS/HSPA and thereon to Long-Term Evolution (LTE) for increasing the capacity and speed of wireless data networks. Considering the real-time constraints of the new wireless standards and their demands for parallel processing, reconfigurable architectures and in particular, multicore platforms are part of the most successful platforms due to providing high computational parallelism and throughput. In addition to that, by moving toward Internet-of-Things (IoT), the number of wireless sensors and IP-based high throughput network routers is growing at a rapid pace. Despite all the progression in IoT, due to power and energy consumption, a single chip platform for providing multiple communication standards and a large processing bandwidth is still missing.The strong demand for performing different sets of operations by the embedded systems and increasing the computational performance has led to the use of heterogeneous multicore architectures with the help of accelerators for computationally-intensive data-parallel tasks acting as coprocessors. Currently, highly heterogeneous systems are the most power-area efficient solution for performing complex signal processing systems. Additionally, the importance of IoT has increased significantly the need for heterogeneous and reconfigurable platforms.On the other hand, subsequent to the breakdown of the Dennardian scaling and due to the enormous heat dissipation, the performance of a single chip was obstructed by the utilization wall since all cores cannot be clocked at their maximum operating frequency. Therefore, a thermal melt-down might be happened as a result of high instantaneous power dissipation. In this context, a large fraction of the chip, which is switched-off (Dark) or operated at a very low frequency (Dim) is called Dark Silicon. The Dark Silicon issue is a constraint for the performance of computers, especially when the up-coming IoT scenario will demand a very high performance level with high energy efficiency. Among the suggested solution to combat the problem of Dark-Silicon, the use of application-specific accelerators and in particular Coarse-Grained Reconfigurable Arrays (CGRAs) are the main motivation of this thesis work.This thesis deals with design and implementation of Software Defined Radio (SDR) as well as High Efficiency Video Coding (HEVC) application-specific accelerators for computationally intensive kernels and data-parallel tasks. One of the most important data transmission schemes in SDR due to its ability of providing high data rates is Orthogonal Frequency Division Multiplexing (OFDM). This research work focuses on the evaluation of Heterogeneous Accelerator-Rich Platform (HARP) by implementing OFDM receiver blocks as designs for proof-of-concept. The HARP template allows the designer to instantiate a heterogeneous reconfigurable platform with a very large amount of custom-tailored computational resources while delivering a high performance in terms of many high-level metrics. The availability of this platform lays an excellent foundation to investigate techniques and methods to replace the Dark or Dim part of chip with high-performance silicon dissipating very low power and energy. Furthermore, this research work is also addressing the power and energy issues of the embedded computing systems by tailoring the HARP for self-aware and energy-aware computing models. In this context, the instantaneous power dissipation and therefore the heat dissipation of HARP are mitigated on FPGA/ASIC by using Dynamic Voltage and Frequency Scaling (DVFS) to minimize the dark/dim part of the chip. Upgraded HARP for self-aware and energy-aware computing can be utilized as an energy-efficient general-purpose transceiver platform that is cognitive to many radio standards and can provide high throughput while consuming as little energy as possible. The evaluation of HARP has shown promising results, which makes it a suitable platform for avoiding Dark Silicon in embedded computing platforms and also for diverse needs of IoT communications.In this thesis, the author designed the blocks of OFDM receiver by crafting templatebased CGRA devices and then attached them to HARP’s Network-on-Chip (NoC) nodes. The performance of application-specific accelerators generated from templatebased CGRAs, the performance of the entire platform subsequent to integrating the CGRA nodes on HARP and the NoC traffic are recorded in terms of several highlevel performance metrics. In evaluating HARP on FPGA prototype, it delivers a performance of 0.012 GOPS/mW. Because of the scalability and regularity in HARP, the author considered its value as architectural constant. In addition to showing the gain and the benefits of maximizing the number of reconfigurable processing resources on a platform in comparison to the scaled performance of several state-of-the-art platforms, HARP’s architectural constant ensures application-independent figure of merit. HARP is further evaluated by implementing various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for HEVC standard, which showed its ability to sustain Full HD 1080p format at 30 fps on FPGA. The author also integrated self-aware computing model in HARP to mitigate the power dissipation of an OFDM receiver. In the case of FPGA implementation, the total power dissipation of the platform showed 16.8% reduction due to employing the Feedback Control System (FCS) technique with Dynamic Frequency Scaling (DFS). Furthermore, by moving to ASIC technology and scaling both frequency and voltage simultaneously, significant dynamic power reduction (up to 82.98%) was achieved, which proved the DFS/DVFS techniques as one step forward to mitigate the Dark Silicon issue

    Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

    Get PDF
    Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs
    corecore