873 research outputs found

    Spatial and temporal data parallelization of the H.261 video coding algorithm

    Get PDF
    In this paper, the parallelization of the H.261 video coding algorithm on the IBM SP2 multiprocessor system is described. The effect of parallelizing computations and communications in the spatial, temporal, and both spatial-temporal domains are considered through the study of frame rate, speedup, and implementation efficiency, which are modeled and measured with respect to the number of nodes (n) and parallel methods used. Four parallel algorithms were developed, of which the first two exploited the spatial parallelism in each frame, and the last two exploited both the temporal and spatial parallelism over a sequence of frames. The two spatial algorithms differ in that one utilizes a single communication master, while the other attempts to distribute communications across three masters. On the other hand, the spatial-temporal algorithms use a pipeline structure for exploiting the temporal parallelism together with either a single master or multiple masters. The best median speedup (frame rate) achieved was close to 15[15 frames per second (fps)] for 352 × 240 video on 24 nodes, and 13 (37 fps) for QCIF video, by the spatial algorithm with distributed communications. For n 10, with efficiency up to 70%. The spatial-temporal algorithms achieved average speedup performance, but are most scalable for large n.published_or_final_versio

    Parallelization of the H.261 video coding algorithm on the IBM SP2(R) multiprocessor system

    Get PDF
    In this paper, the parallelization of the H.261 video coding algorithm on the IBM SP2 multiprocessor system is described. Based on domain decomposition as a framework, data partitioning, data dependencies and communication issues are carefully assessed. From these, two parallel algorithms were developed with the first one maximizes on processor utilization and the second one minimizes on communications. Our analysiis shows that the first algorithm exhibits poor scalability and high communication overhead; and the second algorithm exhibits good scalability and low communication overhead. A best median speed up of 13.72 or 11 frameskec was achieved on 24 processors.published_or_final_versio

    Generalized parallelization methodology for video coding

    Get PDF
    This paper describes a generalized parallelization methodology for mapping video coding algorithms onto a multiprocessing architecture, through systematic task decomposition, scheduling and performance analysis. It exploits data parallelism inherent in the coding process and performs task scheduling base on task data size and access locality with the aim to hide as much communication overhead as possible. Utilizing Petri-nets and task graphs for representation and analysis, the method enables parallel video frame capturing, buffering and encoding without extra communication overhead. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. A H.261 video encoder has been implemented on a TMS320C80 system using this method, and its performance was measured. The theoretical and measured performances are similar in that the measured speedup of the H.261 is 3.67 and 3.76 on four PP for QCIF and 352×240 video, respectively. They correspond to frame rates of 30.7 frame per second (fps) and 9.25 fps, and system efficiency of 91.8% and 94% respectively. As it is, this method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    Generalized parallelization methodology for video coding

    Get PDF
    This paper describes a generalized parallelization methodology for mapping video coding algorithms onto a multiprocessing architecture, through systematic task decomposition, scheduling and performance analysis. It exploits data parallelism inherent in the coding process and performs task scheduling base on task data size and access locality with the aim to hide as much communication overhead as possible. Utilizing Petri-nets and task graphs for representation and analysis, the method enables parallel video frame capturing, buffering and encoding without extra communication overhead. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. A H.261 video encoder has been implemented on a TMS320C80 system using this method, and its performance was measured. The theoretical and measured performances are similar in that the measured speedup of the H.261 is 3.67 and 3.76 on four PP for QCIF and 352×240 video, respectively. They correspond to frame rates of 30.7 frame per second (fps) and 9.25 fps, and system efficiency of 91.8% and 94% respectively. As it is, this method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    Fast and parallel video encoding by workload balancing

    Get PDF
    The issue of balancing the macroblocks (MB) computing workload across the processors are explored. These includes, the prediction of the workload based on the previous frame workload and the scheduling of the MB bounded by the locality constraint. The algorithm was implemented on an IBM SP2, and the results showed that the reduction in the worst case delay is around 19-23%, with both the prediction and scheduling overhead taken into account. Because of the critical path reduction, the overall processor utilization was increased and the overall coding rate improved.published_or_final_versio

    Parallelization methodology for video coding - an implementation on the TMS320C80

    Get PDF
    This paper presents a parallelization methodology for video coding based on the philosophy of hiding as much communications by computation as possible. It models the task/data size, processor cache capacity, and communication contention, through a systematic decomposition and scheduling approach. With the aid of Petri-nets and task graphs for representation and analysis, it employs a triple buffering scheme to enable the functions of frame capture, management, and coding to be performed in parallel. The theoretical speedup analysis indicates that this method offers excellent communication hiding, resulting in system efficiency well above 90%. To prove its practicality, a H.261 video encoder has been implemented on a TMS320C80 system using the method. Its performance was measured, from which the speedup and efficiency figures were calculated. The only difference detected between the theoretical and measured data is the program control overhead that has not been accounted for in the theoretical model. Even with this, the measured speedup of the H.261 is 3.67 and 3.76 on four parallel processors (PPs) for QCIF and 352 × 240 video, respectively, which correspond to frame rate of 30.7 and 9.25 frames per second, and system efficiency of 91.8% and 94%, respectively. This method is particularly efficient for platforms with small number of parallel processors.published_or_final_versio

    Adaptive parallel video-coding algorithm

    Get PDF
    Parallel encoding of video inevitably frame rate gives varying rate performance due to dynamically changing video content and motion field since the encoding process of each macro-block, especially motion estimation, is data dependent. A multiprocessor schedule optimized for a particular frame with certain macro-block encoding time may not be optimized towards another frame with different encoding time, which causes performance degradation to the parallelization. To tackle this problem, we propose a method based on a batch of near-optimal schedules generated at compile-time and a run-time mechanism to select the schedule giving the shortest predicted critical path length. This method has the advantage of being near-optimal using compile-time schedules while involving only run-time selection rather than re-scheduling. Implementation on the IBM SP2 multiprocessor system using 24 processors gives an average speedup of about 13.5 (frame rate of 38.5 frames per second) for a CIF sequence consisting of segments of 6 different scenes. This is equivalent to an average improvement of about 16.9% over the single schedule scheme with schedule adapted to each of the scenes. Using an open test sequence consisting of 8 video segments, the average improvement achieved is 13.2%, i.e. an average speedup of 13.3 (35.6 frames per second).published_or_final_versio

    Optimization of 3-D Wavelet Decomposition on Multiprocessors

    Get PDF
    In this work we discuss various ideas for the optimization of 3-D wavelet/subband decomposition on shared memory MIMD computers. We theoretically evaluate the characteristics of these approaches and verify the results on parallel computers. Experimental results are conducted on a shared memory as well as a virtual shared memory architecture

    VLSI architectures design for encoders of High Efficiency Video Coding (HEVC) standard

    Get PDF
    The growing popularity of high resolution video and the continuously increasing demands for high quality video on mobile devices are producing stronger needs for more efficient video encoder. Concerning these desires, HEVC, a newest video coding standard, has been developed by a joint team formed by ISO/IEO MPEG and ITU/T VCEG. Its design goal is to achieve a 50% compression gain over its predecessor H.264 with an equal or even higher perceptual video quality. Motion Estimation (ME) being as one of the most critical module in video coding contributes almost 50%-70% of computational complexity in the video encoder. This high consumption of the computational resources puts a limit on the performance of encoders, especially for full HD or ultra HD videos, in terms of coding speed, bit-rate and video quality. Thus the major part of this work concentrates on the computational complexity reduction and improvement of timing performance of motion estimation algorithms for HEVC standard. First, a new strategy to calculate the SAD (Sum of Absolute Difference) for motion estimation is designed based on the statistics on property of pixel data of video sequences. This statistics demonstrates the size relationship between the sum of two sets of pixels has a determined connection with the distribution of the size relationship between individual pixels from the two sets. Taking the advantage of this observation, only a small proportion of pixels is necessary to be involved in the SAD calculation. Simulations show that the amount of computations required in the full search algorithm is reduced by about 58% on average and up to 70% in the best case. Secondly, from the scope of parallelization an enhanced TZ search for HEVC is proposed using novel schemes of multiple MVPs (motion vector predictor) and shared MVP. Specifically, resorting to multiple MVPs the initial search process is performed in parallel at multiple search centers, and the ME processing engine for PUs within one CU are parallelized based on the MVP sharing scheme on CU (coding unit) level. Moreover, the SAD module for ME engine is also parallelly implemented for PU size of 32×32. Experiments indicate it achieves an appreciable improvement on the throughput and coding efficiency of the HEVC video encoder. In addition, the other part of this thesis is contributed to the VLSI architecture design for finding the first W maximum/minimum values targeting towards high speed and low hardware cost. The architecture based on the novel bit-wise AND scheme has only half of the area of the best reference solution and its critical path delay is comparable with other implementations. While the FPCG (full parallel comparison grid) architecture, which utilizes the optimized comparator-based structure, achieves 3.6 times faster on average on the speed and even 5.2 times faster at best comparing with the reference architectures. Finally the architecture using the partial sorting strategy reaches a good balance on the timing performance and area, which has a slightly lower or comparable speed with FPCG architecture and a acceptable hardware cost


    Get PDF
    In this paper a new approach for coding moving pictures is presented. Because of the large number of calculations, the conventional solution uses tightly coupled multiproces- sors working in parallel to achieve real-time processing (encoding 25 - 30 pictures per second). A new idea is to distribute the workload among workstations connected to a network where a software package (e.g. PVM - Parallel Virtual Machine) supports the communication between the machines. In contrast with the present hard wired structures, this loosely coupled system provides more flexibility in coding algorithms and has better cost/performance. The paper describes the main parallel structures already used in video processing, and discusses the possibility of mapping them to this new paralell system. Also. simulations were carried out to examine the performance of the most computation- ally intensive operations (DCT - Discrete Cosine Transform and motion estimation). The tests were performed on a cluster of SUN Sparc 2s connected via Ethernel. It was experienced that DCT did not show any speed-up because of the extremely low CC ra- tio. However, motion estimation worked well if either a full or hierarchical search was used. This research work was carried out in 1994 at the Information Theory Group of the Department of Electrical Engineering, Technical University of Delft