442 research outputs found

    Side information exploitation, quality control and low complexity implementation for distributed video coding

    Get PDF
    Distributed video coding (DVC) is a new video coding methodology that shifts the highly complex motion search components from the encoder to the decoder, such a video coder would have a great advantage in encoding speed and it is still able to achieve similar rate-distortion performance as the conventional coding solutions. Applications include wireless video sensor networks, mobile video cameras and wireless video surveillance, etc. Although many progresses have been made in DVC over the past ten years, there is still a gap in RD performance between conventional video coding solutions and DVC. The latest development of DVC is still far from standardization and practical use. The key problems remain in the areas such as accurate and efficient side information generation and refinement, quality control between Wyner-Ziv frames and key frames, correlation noise modelling and decoder complexity, etc. Under this context, this thesis proposes solutions to improve the state-of-the-art side information refinement schemes, enable consistent quality control over decoded frames during coding process and implement highly efficient DVC codec. This thesis investigates the impact of reference frames on side information generation and reveals that reference frames have the potential to be better side information than the extensively used interpolated frames. Based on this investigation, we also propose a motion range prediction (MRP) method to exploit reference frames and precisely guide the statistical motion learning process. Extensive simulation results show that choosing reference frames as SI performs competitively, and sometimes even better than interpolated frames. Furthermore, the proposed MRP method is shown to significantly reduce the decoding complexity without degrading any RD performance. To minimize the block artifacts and achieve consistent improvement in both subjective and objective quality of side information, we propose a novel side information synthesis framework working on pixel granularity. We synthesize the SI at pixel level to minimize the block artifacts and adaptively change the correlation noise model according to the new SI. Furthermore, we have fully implemented a state-of-the-art DVC decoder with the proposed framework using serial and parallel processing technologies to identify bottlenecks and areas to further reduce the decoding complexity, which is another major challenge for future practical DVC system deployments. The performance is evaluated based on the latest transform domain DVC codec and compared with different standard codecs. Extensive experimental results show substantial and consistent rate-distortion gains over standard video codecs and significant speedup over serial implementation. In order to bring the state-of-the-art DVC one step closer to practical use, we address the problem of distortion variation introduced by typical rate control algorithms, especially in a variable bit rate environment. Simulation results show that the proposed quality control algorithm is capable to meet user defined target distortion and maintain a rather small variation for sequence with slow motion and performs similar to fixed quantization for fast motion sequence at the cost of some RD performance. Finally, we propose the first implementation of a distributed video encoder on a Texas Instruments TMS320DM6437 digital signal processor. The WZ encoder is efficiently implemented, using rate adaptive low-density-parity-check accumulative (LDPCA) codes, exploiting the hardware features and optimization techniques to improve the overall performance. Implementation results show that the WZ encoder is able to encode at 134M instruction cycles per QCIF frame on a TMS320DM6437 DSP running at 700MHz. This results in encoder speed 29 times faster than non-optimized encoder implementation. We also implemented a highly efficient DVC decoder using both serial and parallel technology based on a PC-HPC (high performance cluster) architecture, where the encoder is running in a general purpose PC and the decoder is running in a multicore HPC. The experimental results show that the parallelized decoder can achieve about 10 times speedup under various bit-rates and GOP sizes compared to the serial implementation and significant RD gains with regards to the state-of-the-art DISCOVER codec

    Generalized parallelization methodology for video coding

    Get PDF
    This paper describes a generalized parallelization methodology for mapping video coding algorithms onto a multiprocessing architecture, through systematic task decomposition, scheduling and performance analysis. It exploits data parallelism inherent in the coding process and performs task scheduling base on task data size and access locality with the aim to hide as much communication overhead as possible. Utilizing Petri-nets and task graphs for representation and analysis, the method enables parallel video frame capturing, buffering and encoding without extra communication overhead. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. A H.261 video encoder has been implemented on a TMS320C80 system using this method, and its performance was measured. The theoretical and measured performances are similar in that the measured speedup of the H.261 is 3.67 and 3.76 on four PP for QCIF and 352×240 video, respectively. They correspond to frame rates of 30.7 frame per second (fps) and 9.25 fps, and system efficiency of 91.8% and 94% respectively. As it is, this method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    Generalized parallelization methodology for video coding

    Get PDF
    This paper describes a generalized parallelization methodology for mapping video coding algorithms onto a multiprocessing architecture, through systematic task decomposition, scheduling and performance analysis. It exploits data parallelism inherent in the coding process and performs task scheduling base on task data size and access locality with the aim to hide as much communication overhead as possible. Utilizing Petri-nets and task graphs for representation and analysis, the method enables parallel video frame capturing, buffering and encoding without extra communication overhead. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. A H.261 video encoder has been implemented on a TMS320C80 system using this method, and its performance was measured. The theoretical and measured performances are similar in that the measured speedup of the H.261 is 3.67 and 3.76 on four PP for QCIF and 352×240 video, respectively. They correspond to frame rates of 30.7 frame per second (fps) and 9.25 fps, and system efficiency of 91.8% and 94% respectively. As it is, this method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    A General Hardware/Software Co-design Methodology for Embedded Signal Processing and Multimedia Workloads

    Get PDF
    This paper presents a hardware/software co-design methodology for partitioning real-time embedded multimedia applications between software programmable DSPs and hardware based FPGA coprocessors. By following a strict set of guidelines, the input application is partitioned between software executing on a programmable DSP and hardware based FPGA implementation to alleviate computational bottlenecks in modern VLIW style DSP architectures used in embedded systems. This methodology is applied to channel estimation firmware in 3.5G wireless receivers, as well as software based H.263 video decoders. As much as an 11x improvement in runtime performance can be achieved by partitioning performance critical software kernels in these workloads into a hardware based FPGA implementation executing in tandem with the existing host DSP.Nokia Inc.Texas InstrumentsNational Science Foundatio

    Parallelization methodology for video coding - an implementation on the TMS320C80

    Get PDF
    This paper presents a parallelization methodology for video coding based on the philosophy of hiding as much communications by computation as possible. It models the task/data size, processor cache capacity, and communication contention, through a systematic decomposition and scheduling approach. With the aid of Petri-nets and task graphs for representation and analysis, it employs a triple buffering scheme to enable the functions of frame capture, management, and coding to be performed in parallel. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. To prove its practicality, a H.261 video encoder has been implemented on a TMS320C80 system using the method. Its performance was measured, from which the speedup and efficiency figures were calculated. The only difference detected between the theoretical and measured data is the program control overhead that has not been accounted for in the theoretical model. Even with this, the measured speedup of the H.261 is 3.67 and 3.76 on four parallel processors (PPs) for QCIF and 352 × 240 video, respectively, which correspond to frame rate of 30.7 and 9.25 frames per second, and system efficiency of 91.8% and 94%, respectively. This method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    VLSI architecture design approaches for real-time video processing

    Get PDF
    This paper discusses the programmable and dedicated approaches for real-time video processing applications. Various VLSI architecture including the design examples of both approaches are reviewed. Finally, discussions of several practical designs in real-time video processing applications are then considered in VLSI architectures to provide significant guidelines to VLSI designers for any further real-time video processing design works
    corecore