Abstract-H.264/AVC applies a complex mode decision technique that has high computational complexity in order to reduce the temporal redundancies of video sequences. Several algorithms have been proposed in the literature in recent years with the aim of accelerating this part of the encoding process. Recently, with the emergence of many-core processors or accelerators, a new approach can be adopted for reducing the complexity of the H.264/AVC encoding algorithm. This paper focuses on reducing the inter prediction complexity adopted in H.264/AVC and proposes a GPU-based implementation using CUDA. Experimental results show that the proposed approach reduces the complexity by as much as 99% (100x of speedup) while maintaining the coding efficiency.
I. INTRODUCTION H.264/AVC [1] is the most recent predictive video compression standard that outperforms other previous existing video codecs [2] . The H.264/AVC standard builds on those previous coding standards to achieve a compression gain of about 50%, largely at the cost of increased encoder complexity [2] . These compression gains are mainly related to the variable and smaller block size motion compensation, improved entropy coding, multiple reference frames, and smaller block transform, among others.
On the other hand, in the past few years new heterogeneous architectures have been introduced into high-performance computing [3] . Examples of such architectures include Graphics Processing Units (GPUs). GPUs are small accelerator devices with hundreds of similar processing cores which are designed and organized with the goal of achieving higher performance. The hardware design also includes multiple cores, bigger memory sizes and better interconnection networks. GPUs are, thus, highly parallel and are normally used as a coprocessor to assist the Central Processing Unit (CPU) in computing massive data.
In order to assist programmers, the main GPU manufacturers provide them with different tools. In this sense Nvidia proposes a powerful GPU architecture called Compute Unified Device Architecture (CUDA) [4] . CUDA is basically a Single Instruction Multiple Data (SIMD) computing device.
At this point, this paper presents an implementation of a part of the H.264/AVC encoding algorithm in a GPU to assist the CPU. In fact, most of the complexity in H.264/AVC encoding algorithm is carried out by the inter prediction. This procedure removes temporal redundancies in video sequences. The Motion Estimation (ME) algorithm is developed as part of the H.264/AVC inter prediction and that is the most timeconsuming process. In the proposed algorithm, the ME algorithm is executed in parallel in the GPU. The ME procedure fits well in the SIMD philosophy because ME performs the same operations over a large amount of data. The proposed parallel-ME algorithm is optimized for CUDA architectures by using a large number of threads that can be executed over the GPU processing cores and can make good use of its resources. Therefore the execution time is higher reduced with negligible Rate -Distortion (RD) penalty. This drop in RD is mainly due to the Motion Vector (MV) prediction. The current Macro-Block (MB) could not access to the neighbouring predictions because they are being calculated in another processing unit. Thus, the present approach is further improved in order to mitigate the effect of these predictors which are the major impairment when the ME algorithm is run in parallel. The proposed algorithm efficiently builds in parallel a memory matrix that may also be read concurrently to generate an approximation to the optimal MB mode-coded partition as part of the inter prediction algorithm. This efficiently memory resources usage gives us the ability of using multiple reference frames for ME and high resolutions with reasonable execution time requirements. In this paper the performance evaluation is carried out for VGA resolutions and High Definition (HD) video sequences such as 720p. The results show a remarkably time reduction up to 99% with negligible coding efficiency penalty. Moreover, the proposed architecture outperforms one of the fastest ME algorithm proposed in the literature such as [5] in terms of coding efficiency and time savings.
The rest of the paper is organized as follows: Section II contains a brief overview of H.264/AVC and GPU programming; in Section III some related proposals are shown; Section IV shows details about the approach presented in this paper; Section V describes the performance evaluation and, finally, conclusions are given in Section 6.
II. BACKGROUND For inter prediction, the H.264/AVC standard adopts many video coding techniques, such as variable block size, quartersample accurate motion compensation, multiple reference frames and weighted prediction. In particular, the process of variable block size ME can search for the optimal matching block by close prediction and it is able to eliminate the temporal redundancy between two or more adjacent frames. In a nutshell, inter prediction in H.264/AVC supports motion compensation block sizes ranging from 16x16, 16x8, 8x16 to 8x8; where each of the sub-divided regions is an MB partition. If the 8x8 mode is chosen, each of the four 8x8 block partitions within the MB may be further split in 4 ways: 8x8, 8x4, 4x8 or 4x4, which are known as sub-MB partitions. Moreover, H.264/AVC defines the Motion Vector prediction (MVp) forming method, which depends on the motion compensation partition size and on the availability of nearby vectors. This MVp, which is generated from the neighbouring MB, is added to the current MV to obtain the best matching block.
GPUs are small accelerator devices with hundreds of cores which are organized in several SIMD blocks, and designed with the goal of achieving high performance. They come primarily from multimedia and gaming applications but GPUs have moved from being exclusively used in graphics applications to being used in what is now called General Purpose Computing on GPU (GPGPU) [6] . GPUs are characterized by a high parallelism level and they are usually used as a coprocessor to assist the CPU in computing massive data.The main feature of these devices is a large number of processing elements integrated into a single chip at the expense of a significant reduction in cache memory. For instance, the architecture of the Nvidia ® GPUs consists of a set of SIMD multiprocessors called Stream Multiprocessors (SM). Each SM has 8, 32 or 48 processing elements called cores and a set of resources shared by all cores: 32-bit registers, a configurable local shared memory, a cache for texture GPU memory and a cache for constant GPU memory. Each core executes the same instruction at the same clock cycle but on different data. Device memory can be classified into three kinds, depending on its access mode. More detail about the Nvidia GPU architecture can be found in [4] .
III. RELATED WORK In the literature, many approaches have been proposed in order to accelerate the H.264/AVC encoding algorithm. Most of them are based on estimating data by using faster algorithms, determining which MB partitions are not suitable to be selected (based on some features), or determining stopping criteria. But, up to now, there are not many solutions which make use of GPU to accelerate this highly complex algorithm, which is the major focus of this paper: to combine the powerful GPU architecture to accelerate traditional video coding algorithms, such as H.264/AVC. In the framework of video processing using GPU, in 2004 Chen et al. in [6] , exploiting the Hyper-Threading architectures, parallelized the H.264/AVC encoder using the OpenMP programming model for Intel architectures. The authors obtained speedups of up to 4x, but this implementation was not a GPU-based approach, it was a hyper-threading one instead. In 2006, Ho et al in [7] presented a ME algorithm for H.264/AVC using GPUs based on a block-by-block basis. In 2007, Lee et al. in [8] presented a multi-pass and frame parallel algorithm to accelerate H.264/AVC ME using a GPU. They unroll and rearrange the multiple nested loops by using the multi-pass method, and the multiple reference frames method is implemented in a frame parallel level by the use of SIMD vector operations of the GPU. In 2008, Ryoo et al. in [9] and Chen and Hang in [10] presented some optimization principles of a multithreaded GPU using CUDA. In [11] the algorithm is based on an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC. They decompose the H.264/AVC motion estimation algorithm into 5 steps so that they can achieve highly parallel computation. The major failing of all these approaches is that they do not show RD performance; although the speedup and time reduction are acceptable, they are only valid if they keep the RD as close as possible to the sequential approach. More recently, in 2010, Cheung et al. proposed a GPU implementation of fast ME which is based on simplified unsymmetrical multi-hexagon search (smpUMHexagonS) [5] . The authors divide the current frame into multiple tiles. Each tile is processed by a single GPU thread, and different tiles are processed by different independent threads concurrently on the GPU. They report significant bitrate increases (4%) with a penalty in quality (0.4dB) depending on the sequence and the tile length. On contrary, the speedup is around 3x. Another work related to GPUs and video coding but focusing on intra prediction is proposed in [13] .
IV. PROPOSED ALGORITHM This section depicts the algorithm for implementing the inter prediction developed in H.264/AVC encoding algorithm into a GPU. The ME as part of the inter prediction process can be implemented in JM reference software [14] by means of various ME techniques; this work departs from the one namely Full Search. Although is well-known that the one proposed in [5] offers better performance in a trade-off between coding efficiency and time consumption, the proposed algorithm outperforms the performance of [5] , as it will be shown in the next section. The presented approach in this paper is focused only on the full-pixel ME, but the algorithm could be also extended for sub-pixel ME and therefore, the time saving could be also increased at the expense of a little penalty in coding efficiency.
The idea of the proposed algorithm is to efficiently distribute all the computations over the cores of the GPU. To achieve this, the proposed ME process is divided into three steps; all of them need to be executed sequentially but each one is exploited following a highly parallel procedure over the GPU. The goal of the first kernel is to obtain the Sum Absolute Differences (SAD) calculation between the current MB (split into sixteen 4x4 partitions) and all MB positions in the reference frame inside the search range. Then the second kernel, by using the previous 4x4 block SAD calculations, is able to obtain the SAD costs for the different sub-partitions. Finally, the last kernel reduces the SAD cost to one SAD cost for each one of the 41 MB partitions of each MB.
In the first kernel, the goal is to obtain the required SAD costs, which are needed to build the structured motion tree later in the next step. In this step, all threads from a thread block cooperate to copy its assigned MB and corresponding search area from texture memory to multiprocessor local shared memory. Shared memory is defined as integer and it allows contiguous multiprocessor threads to read from contiguous memory banks without access conflicts in the memory banks. The SAD calculation is carried out in 4x4 blocks, therefore each MB is divided into sixteen 4x4 blocks for each search area position. These SAD costs are stored in the texture memory. The complete search area is computed by rows, one or more rows corresponding to a thread block, so contiguous search area positions for a certain MB are computed by the same thread block, normally 256 positions for each thread block.
The basic purpose for the second step is to build the structured motion tree, obtaining the SAD cost for all MB partition/sub-partitions. This step also carries out a first reduction due to the large amount of data generated. The sixteen 4x4 SAD costs associated to the 64 positions of any thread block are allocated in shared memory. All threads of the same thread block cooperate to allocate the data, build the SAD costs into SM shared local memory for all MB partitions/sub-partitions ( Figure 1 ) and iterate with binary reductions per partition in order to obtain the best SAD cost for all partitions/sub-partitions. In the proposed algorithm, each thread block calculates 64 positions inside the search area. Finally, the last kernel obtains the best SAD cost for each one of the MB partitions/sub-partitions of each MB, using the same binary reduction procedure of kernel 2.
Multiple reference frames for the ME procedure is one of the improvements introduced by H.264/AVC with the aim of reducing the bit rate of the encoded sequence, for this case, our 3-step algorithm is carried out one time per reference frame. Figure 2 shows a simplified activity diagram of our parallel ME proposal (which also support multiple reference frame).
Fig. 2. Proposal Activity diagram
In H.264 standard, the MVp is the median of the MVs in the adjacent left, top and top-right blocks. Therefore, the MVs of the neighbouring MBs would need to be first determined, but this dependency makes difficult to utilize GPUs for ME.
Furthermore, the approach presented in this paper also tries to mitigate the effect of MVp which is one the bigger challenges of developing the ME process in parallel. The idea to solve these impairments consists of reusing the MV of the previous frame to adjust the MB search area. The MVp has not a high impact for low resolution and/or low motion sequences, but the lack of MV predictors for higher resolutions and/or high motion sequences may result in a significant PSNR drop and an increase in the bitrate required to encode the sequence.
At the beginning of the first step, in which all 4x4 SAD costs for all MBs in a frame are calculated in parallel, the Search Area starting coordinates (SA x and SA y ) are calculated using Equations 1 and 2. Both equations depend on the MB position inside the frame (MB x and MB y ) and the prediction (pred x and pred y ) based on previous frames (only reference frames), which is the big challenge of this new proposal.
SA y = MB x + pred y
The search area in our work is the same for all MB partitions and sub-partitions involved in the ME algorithm, so we need one MV predictor per MB. Thus, we were testing how to obtain this predictor and the result was that the best way to contain the motion is to use only the biggest MB partition (16x16) from the previous frame as MV predictor. However, as a consequence of using motion information related to the previous frame instead of information from adjacent MBs, we use half of the 16x16 MV due to the distance between both frames is 2 and then, the prediction is less reliable. 16x16 MVs are maintained in GPU DRAM from one frame to another to be used as MV predictors in the next frame. As a consequence of the prediction, a better MV can be found, but the MV length can be higher, increasing the number of bits required to codify them. So it is necessary to include a penalization for large MVs before starting the reduction in the second kernel. The penalization is based on Equation 3:
where newSAD_cost is the penalized SAD cost, SAD_cost is the SAD_cost without penalization, K is a constant which depends on the QP with which the sequence is being coded and defined by the JM reference software [14] , and vector_bits is the bitrate needed to code the MV associated to each position.
V. EVALUATION PERFORMANCE The main challenge of this proposal is to maintain the encoding efficiency while reducing the time consuming for the inter prediction process as part of the H.264/AVC encoding algorithm. The proposed approach has been developed and tested in the JM 17.2 implementation of the H.264/AVC encoder but, it could be also applied to other implementations of the standard such as X.264 [15] with similar results. The evaluation has been carried out with the recent and powerful GPU, namely Nvidia GTX480 which presents the characteristics depicted in Table I . For testing, four VGA and four 720p sequences were encoded using a GOP pattern composed of one I frame followed by eleven P frames and the Quantization Parameter (QP) was varied among 28, 32, 36 and 40 in order to perform the evaluation according to [16] . The length of the sequences is 300 frames for all the video tests.
The configuration parameters used in the H.264/AVC encoder configuration file were those included in the baseline profile as default, but some parameters were changed for the evaluation: the search range was set to 32, which means a search area with 4096 positions; the number of reference frames was set to 1, 3 and 5; the frame rate parameter was set to 30 for VGA sequences and 50 for 720p sequences (the sequences were sampled at 30Hz and 50Hz respectively); and RD-optimization was disabled.
The performance evaluation of our proposed H.264/AVC based on JM 17.2 encoder was carried out in an environment composed of an Intel® Core™ i7 @930 running at 2.80 GHz, with 6GB DDR3 memory and the GPU Nvidia GTX480. The operating system was Ubuntu 10.4 with the Nvidia GPU driver 260.19.
A. Metrics
In order to evaluate the time saved by the proposed algorithm with respect to the reference H.264/AVC encoder, two metrics were used: Time Reduction (TR) which is based on Equation 4 and Speedup which is based on Equation 5 .
where T JM denotes the coding time used by the reference software, and T FI is the time taken by the algorithm proposed in this paper. The time measured by T JM and T FI can refer to the time employed to encode the full sequence or the time employed to carry out the ME. In both cases T FI also includes all the computational costs for the operations needed in order to prepare the information required by our proposal. Table II shows the RD performance and time reduction (both full sequence and ME) for VGA sequences according to the number of reference frames used. As the results show the performance is almost the same and for Waterfall sequence is even better. Table III shows the RD performance and time reduction of our proposed algorithm for 720p sequences. The RD performance obtained is practically the same compared with the reference software, sometimes slightly better sometimes slightly worse, regardless of the number of reference frames set. Furthermore, for both resolutions our proposal obtains considerable time reductions, which increase as more reference frames are included. The average full sequence time reduction is always better than 88%, while the average ME time reduction is better than 98 % for all cases. Figure 3 shows the RD results for VGA sequences, for different sequences using 1, 3 and 5 reference frames, from a value of 28 to 40 for QP. In this figure both Waterfall and Mobile sequences are shown and, moreover, the number of references frames is plotted. The performance of the proposed GPU-based algorithm is very close to the reference algorithm with remarkable time reduction. Figure 4 shows the RD graphic results for the reference and the proposed approach, for different sequences in 720p format using 1, 3 and 5 reference frames, from a value of 28 to 40 for QP. As seen from the figure, the PSNR against bit rate obtained with the proposed encoder deviates slightly from the results obtained when applying the sequential reference encoder. As expected, figure 3 and figure 4 show that using more reference frames increases the H.264/AVC bit stream quality, while the amount of bits required to encode them is reduced. Due to space limitations only a sub-set of the complete set of sequences is shown in both figures. reference.720p Sequences Figure 5 and 6 show the speedup for the full sequence encoder process and for the ME process obtained by our proposal compared with the reference software. The result shows that our algorithm can obtain a speedup of over 97 and 83 for VGA and 720p resolutions respectively. These speedups represent an improvement of over 17 and 14 for the full sequence encoder procedure. Finally, figure 7 shows the performance of our GPU-based approach against the smpUMHexagonS approach [5] which is implemented in JM 17.2 [14] for Mobile sequence.
B. Results
SmpUMHexagonS is the fastest ME available for this reference software. The result shows that our algorithm can obtain a speedup of over 3x and over 2x for VGA and 720p resolutions respectively. In terms of coding efficiency, the proposed algorithm outperforms the SmpUMHexagonS technique in both PSNR and bitrate as Table IV depicts. In fact our GPU-based algorithm can save the bitrate by 1.21% and 3.87% for VGA and 720p respectively while improving the quality of the sequence by 0.041dB and 0.115 dB for VGA and 720p respectively. UmHexagons ME VGA 720 p Fig. 7 . Speedup of the proposed GPU-based algorithm against SmpUMHexagonS [5] VI. CONCLUSION This paper presents an algorithm that concurrently executes the ME as part of the inter prediction procedure developed for H.264/AVC over an accelerator-based many-core system (GPU). The algorithm is based on an efficient construction of data structures that can be generated and read in parallel in order to develop the inter prediction procedure in H.264/AVC. Exploiting current GPU computational capacity gives us another way to accelerate inter prediction in order to develop faster video encoders. The present algorithm is also adapted for multiple-frame reference-based ME and high video resolutions as well. The results show practically the same RD performance and a considerable time reduction with respect to the reference encoder. 
