135 research outputs found

    A unified 4/8/16/32-point integer IDCT architecture for multiple video coding standards

    Get PDF
    (4096x2048) 30fps video sequence at 191MHz working frequency, with 93K gate count and 18944-bit SRAM. We suggest a normalized criterion called design efficiency to compare with previous works. It shows that this design is 31% more efficient than previous work

    Energy-efficient acceleration of MPEG-4 compression tools

    Get PDF
    We propose novel hardware accelerator architectures for the most computationally demanding algorithms of the MPEG-4 video compression standard-motion estimation, binary motion estimation (for shape coding), and the forward/inverse discrete cosine transforms (incorporating shape adaptive modes). These accelerators have been designed using general low-energy design philosophies at the algorithmic/architectural abstraction levels. The themes of these philosophies are avoiding waste and trading area/performance for power and energy gains. Each core has been synthesised targeting TSMC 0.09 μm TCBN90LP technology, and the experimental results presented in this paper show that the proposed cores improve upon the prior art

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

    Register transfer level design of transpose memory for the two-dimension inverse discrete cosine transform for high efficiency video coding

    Get PDF
    The rapid revolution in consumer devices have caused in a variety of emerging video coding applications which contribute the aggressive demands on video compression requirement. The requirement of video compression efficiency getting higher. Today, Advance Video Coding (AVC) standard was replaced by the new High Efficiency Video Coding (HEVC) video compression standard due to major advance in compression compare to former. However, optimizing coding efficiency in HEVC is the root of increased computational complexity. Thus, Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) are absolute necessary accelerator in HEVC hardware implementation. However, the hardware design of these accelerator complexity become more complicated due to flexibility given by the new video compression standard. This project aimed to design Two-Dimension Inverse Discrete Cosine Transform (2D IDCT) hardware transpose memory using hardware description language. The first objective in this project was implemented transpose memory that support different transform block dimension (4x4, 8x8, 16x16 and 32x32 transform unit). Both register-based design and RAM-based design were implemented. Secondly, a test bench was designed to validate the functionality of RTL design. Third, the integration was done between 1D IDCT building block with designed transpose memory and overall system functionality was validated. Finally, analysis was done to find out trade-off in performance, resource and power between register-based and dedicate RAM based transpose memory. The results show that register-based 2D IDCT have 2.24 times better throughput and 35.6% less energy consumption compare to RAM-based 2D IDCT. However, register-based 2D IDCT have 30 times more resource utilization compare to RAM-based 2-D IDCT. Thus, RAM-based 2D IDCT is more suitable for small electronic device. If area expenses is negligible and performance is needed, register-based 2D IDCT can be considered

    A Cost Shared Quantization Algorithm and its Implementation for Multi-Standard Video CODECS

    Get PDF
    The current trend of digital convergence creates the need for the video encoder and decoder system, known as codec in short, that should support multiple video standards on a single platform. In a modern video codec, quantization is a key unit used for video compression. In this thesis, a generalized quantization algorithm and hardware implementation is presented to compute quantized coefficient for six different video codecs including the new developing codec High Efficiency Video Coding (HEVC). HEVC, successor to H.264/MPEG-4 AVC, aims to substantially improve coding efficiency compared to AVC High Profile. The thesis presents a high performance circuit shared architecture that can perform the quantization operation for HEVC, H.264/AVC, AVS, VC-1, MPEG- 2/4 and Motion JPEG (MJPEG). Since HEVC is still in drafting stage, the architecture was designed in such a way that any final changes can be accommodated into the design. The proposed quantizer architecture is completely division free as the division operation is replaced by multiplication, shift and addition operations. The design was implemented on FPGA and later synthesized in CMOS 0.18 μm technology. The results show that the proposed design satisfies the requirement of all codecs with a maximum decoding capability of 60 fps at 187.3 MHz for Xilinx Virtex4 LX60 FPGA of a 1080p HD video. The scheme is also suitable for low-cost implementation in modern multi-codec systems

    Performance evaluation of H.264/AVC decoding and visualization using the GPU

    Get PDF
    The coding efficiency of the H.264/AVC standard makes the decoding process computationally demanding. This has limited the availability of cost-effective, high-performance solutions. Modern computers are typically equipped with powerful yet cost-effective Graphics Processing Units (GPUs) to accelerate graphics operations. These GPUs can be addressed by means of a 3-D graphics API such as Microsoft Direct3D or OpenGL, using programmable shaders as generic processing units for vector data. The new CUDA (Compute Unified Device Architecture) platform of NVIDIA provides a straightforward way to address the GPU directly, without the need for a 3-D graphics API in the middle. In CUDA, a compiler generates executable code from C code with specific modifiers that determine the execution model. This paper first presents an own-developed H.264/AVC renderer, which is capable of executing motion compensation (MC), reconstruction, and Color Space Conversion (CSC) entirely on the GPU. To steer the GPU, Direct3D combined with programmable pixel and vertex shaders is used. Next, we also present a GPU-enabled decoder utilizing the new CUDA architecture from NVIDIA. This decoder performs MC, reconstruction, and CSC on the GPU as well. Our results compare both GPU-enabled decoders, as well as a CPU-only decoder in terms of speed, complexity, and CPU requirements. Our measurements show that a significant speedup is possible, relative to a CPU-only solution. As an example, real-time playback of high-definition video (1080p) was achieved with our Direct3D and CUDA-based H.264/AVC renderers

    Performance evaluation of MPEG-4 Video encoder on Adres

    Get PDF
    Curs 2006-2007Actualment un típic embedded system (ex. telèfon mòbil) requereix alta qualitat per portar a terme tasques com codificar/descodificar a temps real; han de consumir poc energia per funcionar hores o dies utilitzant bateries lleugeres; han de ser el suficientment flexibles per integrar múltiples aplicacions i estàndards en un sol aparell; han de ser dissenyats i verificats en un període de temps curt tot i l’augment de la complexitat. Els dissenyadors lluiten contra aquestes adversitats, que demanen noves innovacions en arquitectures i metodologies de disseny. Coarse-grained reconfigurable architectures (CGRAs) estan emergent com a candidats potencials per superar totes aquestes dificultats. Diferents tipus d’arquitectures han estat presentades en els últims anys. L’alta granularitat redueix molt el retard, l’àrea, el consum i el temps de configuració comparant amb les FPGAs. D’altra banda, en comparació amb els tradicionals processadors coarse-grained programables, els alts recursos computacionals els permet d’assolir un alt nivell de paral•lelisme i eficiència. No obstant, els CGRAs existents no estant sent aplicats principalment per les grans dificultats en la programació per arquitectures complexes. ADRES és una nova CGRA dissenyada per I’Interuniversity Micro-Electronics Center (IMEC). Combina un processador very-long instruction word (VLIW) i un coarse-grained array per tenir dues opcions diferents en un mateix dispositiu físic. Entre els seus avantatges destaquen l’alta qualitat, poca redundància en les comunicacions i la facilitat de programació. Finalment ADRES és un patró enlloc d’una arquitectura concreta. Amb l’ajuda del compilador DRESC (Dynamically Reconfigurable Embedded System Compile), és possible trobar millors arquitectures o arquitectures específiques segons l’aplicació. Aquest treball presenta la implementació d’un codificador MPEG-4 per l’ADRES. Mostra l’evolució del codi per obtenir una bona implementació per una arquitectura donada. També es presenten les característiques principals d’ADRES i el seu compilador (DRESC). Els objectius són de reduir al màxim el nombre de cicles (temps) per implementar el codificador de MPEG-4 i veure les diferents dificultats de treballar en l’entorn ADRES. Els resultats mostren que els cícles es redueixen en un 67% comparant el codi inicial i final en el mode VLIW i un 84% comparant el codi inicial en VLIW i el final en mode CGA.Nowadays, a typical embedded system requires high performance to perform tasks such as video encoding/decoding at run-time. It should consume little energy to work hours or even days using a lightweight battery. It should be flexible enough to integrate multiple applications and standards in one single device. It has to be designed and verified in short time-to-market despite substantially increased complexity. The designers are struggling to meet these huge challenges, which call for innovations of both architectures and design methodology. Coarse-grained reconfigurable architectures (CGRAs) are emerging as potential candidates to meet the above challenges. Many of them were proposed in recent years. This coarse granularity greatly reduces delay, area, power and configuration time compared with FPGAs. On the other hand, compared with traditional "coarse-grained" programmable processors, their massive computational resources enable them to achieve high parallelism and efficiency. However, existing CGRAs have yet been widely adopted mainly because of programming difficulty for such complex architecture. ADRES is a novel CGRA designed by Interuniversity Micro-Electronics Center (IMEC). It tightly couples a very-long instruction word (VLIW) processor and a coarse-grained array by providing two functional views on the same physical resources. It brings advantages such as high performance, low communication overhead and easiness of programming. Finally, ADRES is a template instead of a concrete architecture. With the retargetable compilation support from DRESC (Dynamically Reconfigurable Embedded System Compile), architectural exploration becomes possible to discover better architectures or design domain-specific architectures. In this thesis, a performance of an MPEG-4 encoder in ADRES is presented. The thesis shows the code evolution to obtain a good implementation for a given architecture. The main features of ADRES and its compiler (DRESC) are presented. The objectives are to reduce as much as possible the amount of cycles (time) spent to encode video in MPEG-4 and test different issues working with ADRES environment. The cycles decrease a 67% comparing initial and final code in VLIW and 84% between initial VLIW and CGA mode.Director/a: Moisès Serra i SerraSupervisor: Eric Delfoss

    Joint Optimization of Low-power DCT Architecture and Effcient Quantization Technique for Embedded Image Compression

    Get PDF
    International audienceThe Discrete Cosine Transform (DCT)-based image com- pression is widely used in today's communication systems. Signi cant research devoted to this domain has demonstrated that the optical com- pression methods can o er a higher speed but su er from bad image quality and a growing complexity. To meet the challenges of higher im- age quality and high speed processing, in this chapter, we present a joint system for DCT-based image compression by combining a VLSI archi- tecture of the DCT algorithm and an e cient quantization technique. Our approach is, rstly, based on a new granularity method in order to take advantage of the adjacent pixel correlation of the input blocks and to improve the visual quality of the reconstructed image. Second, a new architecture based on the Canonical Signed Digit and a novel Common Subexpression Elimination technique is proposed to replace the constant multipliers. Finally, a recon gurable quantization method is presented to e ectively save the computational complexity. Experimental results obtained with a prototype based on FPGA implementation and com- parisons with existing works corroborate the validity of the proposed optimizations in terms of power reduction, speed increase, silicon area saving and PSNR improvement