25 research outputs found

    CABAC accelerator architectures for video compression in future multimedida : a survey

    Get PDF
    The demands for high quality, real-time performance and multi-format video support in consumer multimedia products are ever increasing. In particular, the future multimedia systems require efficient video coding algorithms and corresponding adaptive high-performance computational platforms. The H.264/AVC video coding algorithms provide high enough compression efficiency to be utilized in these systems, and multimedia processors are able to provide the required adaptability, but the algorithms complexity demands for more efficient computing platforms. Heterogeneous (re-)configurable systems composed of multimedia processors and hardware accelerators constitute the main part of such platforms. In this paper, we survey the hardware accelerator architectures for Context-based Adaptive Binary Arithmetic Coding (CABAC) of Main and High profiles of H.264/AVC. The purpose of the survey is to deliver a critical insight in the proposed solutions, and this way facilitate further research on accelerator architectures, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters, and some promising design directions are discussed. The comparative analysis shows that the parallel pipeline accelerator architecture seems to be the most promising

    A Deeply Pipelined CABAC Decoder for HEVC Supporting Level 6.2 High-tier Applications

    Get PDF
    High Efficiency Video Coding (HEVC) is the latest video coding standard that specifies video resolutions up to 8K Ultra-HD (UHD) at 120 fps to support the next decade of video applications. This results in high-throughput requirements for the context adaptive binary arithmetic coding (CABAC) entropy decoder, which was already a well-known bottleneck in H.264/AVC. To address the throughput challenges, several modifications were made to CABAC during the standardization of HEVC. This work leverages these improvements in the design of a high-throughput HEVC CABAC decoder. It also supports the high-level parallel processing tools introduced by HEVC, including tile and wavefront parallel processing. The proposed design uses a deeply pipelined architecture to achieve a high clock rate. Additional techniques such as the state prefetch logic, latched-based context memory, and separate finite state machines are applied to minimize stall cycles, while multibypass- bin decoding is used to further increase the throughput. The design is implemented in an IBM 45nm SOI process. After place-and-route, its operating frequency reaches 1.6 GHz. The corresponding throughputs achieve up to 1696 and 2314 Mbin/s under common and theoretical worst-case test conditions, respectively. The results show that the design is sufficient to decode in real-time high-tier video bitstreams at level 6.2 (8K UHD at 120 fps), or main-tier bitstreams at level 5.1 (4K UHD at 60 fps) for applications requiring sub-frame latency, such as video conferencing

    Video decoder for H.264/AVC main profile power efficient hardware design.

    Get PDF
    Yim, Ka Yee.Thesis (M.Phil.)--Chinese University of Hong Kong, 2011.Includes bibliographical references (p. 43).Abstracts in English and Chinese.Acknowledgements --- p.viiTABLE OF CONTENTS --- p.viiiLIST OF TABLES --- p.xLIST OF FIGURES --- p.xiChapter CHAPTER 1 : --- INTRODUCTION --- p.1Chapter 1.1. --- Motivation --- p.1Chapter 1.2. --- Overview --- p.2Chapter 1.3. --- H.264 Overview --- p.2Chapter CHAPTER 2 : --- CABAC --- p.7Chapter 2.1. --- Introduction --- p.7Chapter 2.2. --- CABAC Decoder Implementation Review --- p.7Chapter 2.3. --- CABAC Algorithm Review --- p.9Chapter 2.4. --- Proposed CABAC Decoder Implementation --- p.13Chapter 2.5. --- FSM Method Bin Matching --- p.20Chapter 2.6. --- CABAC Experimental Results --- p.22Chapter 2.7. --- Summary --- p.26Chapter CHAPTER 3 : --- INTEGRATION --- p.27Chapter 3.1. --- Introduction --- p.27Chapter 3.2. --- Reused Baseline Decoder Review --- p.27Chapter 3.3. --- Integration --- p.30Chapter 3.4. --- Proposed Solution for Motion Vector Decoding --- p.33Chapter 3.5. --- Synthesis Result and Performance Analysis --- p.37Chapter CHAPTER 4 : --- CONCLUSION --- p.39Chapter 4.1. --- Main Contribution --- p.39Chapter 4.2. --- Reflection on the Development --- p.39Chapter 4.3. --- Future Work --- p.41BIBLIOGRAPHY --- p.4

    Parallel algorithms and architectures for low power video decoding

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 197-204).Parallelism coupled with voltage scaling is an effective approach to achieve high processing performance with low power consumption. This thesis presents parallel architectures and algorithms designed to deliver the power and performance required for current and next generation video coding. Coding efficiency, area cost and scalability are also addressed. First, a low power video decoder is presented for the current state-of-the-art video coding standard H.264/AVC. Parallel architectures are used along with voltage scaling to deliver high definition (HD) decoding at low power levels. Additional architectural optimizations such as reducing memory accesses and multiple frequency/voltage domains are also described. An H.264/AVC Baseline decoder test chip was fabricated in 65-nm CMOS. It can operate at 0.7 V for HD (720p, 30 fps) video decoding and with a measured power of 1.8 mW. The highly scalable decoder can tradeoff power and performance across >100x range. Second, this thesis demonstrates how serial algorithms, such as Context-based Adaptive Binary Arithmetic Coding (CABAC), can be redesigned for parallel architectures to enable high throughput with low coding efficiency cost. A parallel algorithm called the Massively Parallel CABAC (MP-CABAC) is presented that uses syntax element partitions and interleaved entropy slices to achieve better throughput-coding efficiency and throughput-area tradeoffs than H.264/AVC. The parallel algorithm also improves scalability by providing a third dimension to tradeoff coding efficiency for power and performance. Finally, joint algorithm-architecture optimizations are used to increase performance and reduce area with almost no coding penalty. The MP-CABAC is mapped to a highly parallel architecture with 80 parallel engines, which together delivers >10x higher throughput than existing H.264/AVC CABAC implementations. A MP-CABAC test chip was fabricated in 65-nm CMOS to demonstrate the power-performance-coding efficiency tradeoff.by Vivienne. Sze.Ph.D

    Network-on-Chip Based H.264 Video Decoder on a Field Programmable Gate Array

    Get PDF
    This thesis develops the first fully network-on-chip (NoC) based h.264 video decoder implemented in real hardware on a field programmable gate array (FPGA). This thesis starts with an overview of the h.264 video coding standard and an introduction to the NoC communication paradigm. Following this, a series of processing elements (PEs) are developed which implement the component algorithms making up the h.264 video decoder. These PEs, described primarily in VHDL with some Verilog and C, are then mapped to an NoC which is generated using the CONNECT NoC generation tool. To demonstrate the scalability of the proposed NoC based design, a second NoC based video decoder is implemented on a smaller FPGA using the same PEs on a more compact NoC topology. The performance of both decoders, as well as their component PEs, is evaluated on real hardware. An analysis of the performance results is conducted and recommendations for future work are made based on the results of this analysis. Aside from the development of the proposed decoder, a major contribution of this thesis is the release of all source materials for this design as open source hardware and software. The release of these materials will allow other researchers to more easily replicate this work, as well as create derivative works in the areas of NoC based designs for FPGA, video coding and decoding, and related areas

    Domain-specific and reconfigurable instruction cells based architectures for low-power SoC

    Get PDF

    System-on-Chip design of a high performance low power full hardware cabac encoder in H.264/AVC

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    On the use of deep learning and parallelism techniques to signifcantly reduce the HEVC intra‑coding time

    Get PDF
    It is well-known that each new video coding standard signifcantly increases in computational complexity with respect to previous standards, and this is particularly true for the HEVC and VVC video coding standards. The development of techniques for reducing the required complexity without afecting the rate/distortion (R/D) performance is therefore always a topic of intense research interest. In this paper, we propose a combination of two powerful techniques, deep learning and parallel computing, to signifcantly reduce the complexity of the HEVC encoding engine. Our experimental results show that a combination of deep learning to reduce the CTU partitioning complexity with parallel strategies based on frame partitioning is able to achieve speedups of up to 26× when 16 threads are used. The R/D penalty in terms of the BD-BR metric depends on the video content, the compression rate and the number of OpenMP threads, and was consistently between 0.35 and 10% for the video sequence test set used in our experiment

    Parallelism and the software-hardware interface in embedded systems

    Get PDF
    This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Parallel architectures for entropy coding in a dual-standard ultra-HD video encoder

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Includes bibliographical references (p. 97-98).The mismatch between the rapid increase in resolution requirements and the slower increase in energy capacity demand more aggressive low-power circuit design techniques to maintain battery life of hand-held multimedia devices. As the operating voltage is lowered to reduce power consumption, the maximum operating frequency of the system must also decrease while the performance requirements remain constant. To meet these performance constraints imposed by the high resolution and complex functionality of video processing systems, novel techniques for increasing throughput are explored. In particular, the entropy coding functional block faces the most stringent requirements to deliver the necessary throughput due to its highly serial nature, especially to sustain real-time encoding. This thesis proposes parallel architectures for high-performance entropy coding for high-resolution, dual-standard video encoding. To demonstrate the most aggressive techniques for achieving standard reconfigurability, two markedly different video compression standards (H.264/AVC and VC-1) are supported. Specifically, the entropy coder must process data generated from a quad full-HD (4096x2160 pixels per frame, the equivalent of four full-HD frames) video at a frame rate of 30 frames per second and perform lossless compression to generate an output bitstream. This block will be integrated into a dual-standard video encoder chip targeted for operation at 0.6V, which will be fabricated following the completion of this thesis. Parallelism, as well as other techniques applied at the syntax element or bit level, are used to achieve the overall throughput requirements. Three frames of video data are processed in parallel at the system level, and varying degrees of parallelism are employed within the entropy coding block for each standard. The VC-1 entropy encoder block encodes 735M symbols per second with a gate count of 136.6K and power consumption of 304.5 pW, and the H.264 block encodes 4.97G binary symbols per second through three-frame parallelism and a 6-bin cascaded pipelining architecture with a critical path delay of 20.05 ns.by Bonnie K. Y. Lam.S.M
    corecore