7 research outputs found

    Parallel deblocking filtering in MPEG-4 AVC/H.264 on massively parallel architectures

    Get PDF
    The deblocking filter in the MPEG-4 AVC/H.264 standard is computationally complex because of its high content adaptivity, resulting in a significant number of data dependencies. These data dependencies interfere with parallel filtering of multiple macroblocks (MBs) on massively parallel architectures. In this letter, we introduce a novel MB partitioning scheme for concurrent deblocking in the MPEG-4 AVC/H. 264 standard, based on our idea of deblocking filter independency, a corrected version of the limited error propagation effect proposed in the letter. Our proposed scheme enables concurrent MB deblocking of luma samples with limited synchronization effort, independently of slice configuration, and is compliant with the MPEG-4 H.264/AVC standard. We implemented the method on the massively parallel architecture of the graphics processing unit (GPU). Experimental results show that our GPU implementation achieves faster-than real-time deblocking at 1309 frames per second for 1080p video pictures. Both software-based deblocking filters and state-of-the-art GPU-enabled algorithms are outperformed in terms of speed by factors up to 10.2 and 19.5, respectively, for 1080p video pictures

    Bit-rate Aware Reconfigurable Architecture For H.264/avc Deblocking Filter

    Get PDF
    In H.264/AVC, DeBlocking Filter (DBF) achieves bit rate savings and it is used to improve visual quality by reducing the presence of blocking artifacts. However, these advantages come at the expense of increasing computational complexity of the DBF due to highly adaptive mode decision and small 4x4 block size. The DBF easily accounts for one third of the computational complexity of the decoder. The computational complexity required for various target applications from mobile to high definition video applications varies significantly. Therefore, it becomes apparent to design efficient architecture to adapt to different requirements. In this work, we exploit the scalability on both the hardware level and the algorithmic level to synergize the performance and to reduce computational complexity. First, we propose a modular DBF architecture which can be scaled to adapt to the required computing capability for various bit-rates, resolutions, and frame rates of video sequences. The scalable architecture is based on FPGA using dynamic partial reconfiguration. This desirable feature of FPGAs makes it possible for different hardware configurations to be implemented during run-time. The proposed design can be scaled to filter up to four different edges simultaneously, resulting in significant reduction of total processing time. Secondly, our experiments show by lowering the bit rate of video sequences, significant reduction in computational complexity can be achieved by the increased presence of skipped macroblocks, thus, avoiding redundant filtering operations. The implemented architecture has been evaluated using Xilinx Virtex-4 ML410 FPGA board. The design can operate at a maximum frequency of 103 MHz. The reconfiguration is done through Internal Configuration Access Port (ICAP) to achieve maximum performance needed by real time applications

    Power consumption reduction techniques for H.264 video compression hardware

    Get PDF
    Video compression systems are used in many commercial products such as digital camcorders, cellular phones and video teleconferencing systems. H.264 / MPEG4 Part 10, the recently developed international standard for video compression, offers significantly better compression efficiency than previous video compression standards. However, this compression efficiency comes with an increase in encoding complexity and therefore in power consumption. Since portable devices operate with battery, it is important to reduce power consumption so that battery life can be increased. In addition, consuming excessive power degrades the performance of integrated circuits, increases packaging and cooling costs, reduces reliability and may cause device failures. In this thesis, we propose novel computational complexity and power reduction techniques for intra prediction, deblocking filter (DBF), and intra mode decision modules of an H.264 video encoder hardware, and intra prediction with template matching (TM) hardware. We quantified the computation reductions achieved by these techniques using H.264 Joint Model reference software encoder. We designed efficient hardware architectures for these video compression algorithms and implemented them in Verilog HDL. We mapped these hardware implementations to Xilinx Virtex FPGAs and estimated their power consumptions using Xilinx XPower Analyzer tool. We integrated the proposed techniques to these hardware implementations and quantified their impact on the power consumptions of these hardware implementations on Xilinx Virtex FPGAs. The proposed techniques significantly reduced the power consumptions of these FPGA implementations in some cases with no PSNR loss and in some cases with very small PSNR loss

    Parallel algorithms and architectures for low power video decoding

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 197-204).Parallelism coupled with voltage scaling is an effective approach to achieve high processing performance with low power consumption. This thesis presents parallel architectures and algorithms designed to deliver the power and performance required for current and next generation video coding. Coding efficiency, area cost and scalability are also addressed. First, a low power video decoder is presented for the current state-of-the-art video coding standard H.264/AVC. Parallel architectures are used along with voltage scaling to deliver high definition (HD) decoding at low power levels. Additional architectural optimizations such as reducing memory accesses and multiple frequency/voltage domains are also described. An H.264/AVC Baseline decoder test chip was fabricated in 65-nm CMOS. It can operate at 0.7 V for HD (720p, 30 fps) video decoding and with a measured power of 1.8 mW. The highly scalable decoder can tradeoff power and performance across >100x range. Second, this thesis demonstrates how serial algorithms, such as Context-based Adaptive Binary Arithmetic Coding (CABAC), can be redesigned for parallel architectures to enable high throughput with low coding efficiency cost. A parallel algorithm called the Massively Parallel CABAC (MP-CABAC) is presented that uses syntax element partitions and interleaved entropy slices to achieve better throughput-coding efficiency and throughput-area tradeoffs than H.264/AVC. The parallel algorithm also improves scalability by providing a third dimension to tradeoff coding efficiency for power and performance. Finally, joint algorithm-architecture optimizations are used to increase performance and reduce area with almost no coding penalty. The MP-CABAC is mapped to a highly parallel architecture with 80 parallel engines, which together delivers >10x higher throughput than existing H.264/AVC CABAC implementations. A MP-CABAC test chip was fabricated in 65-nm CMOS to demonstrate the power-performance-coding efficiency tradeoff.by Vivienne. Sze.Ph.D
    corecore