Abstract
Introduction
H.264/AVC is a recent video compression standard that can often achieve much higher compression ratio than other video codecs [1] . To achieve such per formance, it e mploys a very complex encoding process, in which, motion estimation is the most time-consuming part and takes about 60%-80% of the total computational load [2] .
In order t o im prove H .264/AVC co ding e fficiency, a l ot of resear chers focus on acceler ating ME with either on software [3, 4] or hardware [5, 6] . But both types of the techniques have short comings. Software algorithms often check a few candidate posi tions to save co mputing ti me; they are suitable for CPU, which runs at very h igh clock frequency and has good branching capability. Compared with software al gorithms, har dware approach es are i nflexible and require l ong de velopment ti me wh en underlying algorithm is changing.
At pres ent, power ful multi-core prog rammable pr ocessors beco me ava ilable t o consumer market. Although Graphic Processor Unit (GPU), which usually employs large number of parallel processors, was originally designed for computer graphics, it has been widely used for general-purpose computing due to its co st and prog rammability [8] - [10] . Mo dern GPUs m ay consist of hundr eds of h ighly decoupled pr ocessing co res to ach ieve i mmense parallel co mputing per formance. For exa mple, t he NVIDIA GeForce 8800 GTS processor, which is used in ou r si mulations, con sists of 96 individual stream processors each running at 1.2 GHz [11] .
Some recent work has focused on using GPU to perform ME in H.264 video encoding [12] - [14] . Y. Lin et al. [12] proposed a multi-pass algorithm to accelerate the motion estimation on traditional GPU architecture. With the multi-pass method to unroll and rear range the multiple nested loops in motion estimation, about 2 times and 14 times speed-up can be achieved for integer-pixel ME and half-pixel ME respectively. However, the multi-pass algorithm [12] based on traditional GPU architecture can not take full use of the powerful computing resources of GPU. Wei-Nien Chen and Hsueh-Ming Hang [13] presented an efficient block-level parallel algorithm on a computer unified device architecture (CUDA) platform, wh ich started multiple ke rnels parallel to co mpute an d co mpare SAD (sum o f absolute difference) costs of different sub-block modes, achieving 12 times speed-up for the variable block size full search (FS) motion estimation. But this approach still has high latency of memory access, because it does not take the full advantage of high-speed on-chip memory of GPU for data transfer. The SAD costs g enerated b y k ernels must be w ritten to device DRAM each ti me, in order to prepare for computing the SAD values. In this paper, wit h the consideration of the memory architecture of GPU [7] , we make full use of the high-speed on-chip memory of GPU, such as registers and shared memory, to optimize the implementation of parallel FS algorithm.
The r est of t his paper is organized as fol lows: Section II p resents CU DA h ardware an d programming models, while Section III describes FS algorithm for motion esti mation i n H.264/AVC briefly and th e p arallel i mplementation o n CUDA . Sect ion IV sho ws the ex perimental result s and analyzes the performance of our algorithm. Finally, Section V concludes.
2. CUDA hardware architecture and programming model 2.1. Hardware architecture CUDA is a powerful GPU architecture with multiple streaming multiprocessors (SMs) inside. The hardware ar chitecture is s hown in Fig .1 . E ach SM in cludes multiple strea ming process ors (SPs) and four types of on-chip memory [7] . In CUDA programming, GPU acts as a coprocessor to the CPU and the original program is composed of two par ts, one is the control program running on host (CPU), the other is the kernel running on dev ice (GPU) . Wh en e xecuting th e kernel , dat a s hould b e f irstly transferred to the device DRAM (global memory) from the host DRAM; and when computations were completed, the results must be transferred back to the host DRAM from the device DRAM. Nowadays most graphics are connected with host through the PCI Express (PCI-E) bus. For example, the NVIDIA GeForce 8800 GTS p rocessor, wh ich i s u sed in o ur si mulations, suppo rts PCI -E 2.0 sp ecification. Theoretically, th e up link and downlink ban dwidth o f PCI-E 2.0 ×16 are both only 8GB/s, whi ch i s much lower t han the bandwidth of t he d evice DRA M a nd on -chip memory o f GPU. If there i s too much dat a tr ansfer betw een th e host an d the device, th e b andwidth o f PCI E xpress bu s will be a bottleneck hindering t he per formance o f the program. So applications s hould strive to minimize data transfer between the host and the device.
Though dev ice DRAM has large band width, i ts access l atency, whi ch is a round 20 0-300 clo ck cycles, is not that great. Therefore GPU adopts a few types of high-speed on-chip memory to lower the latency. For example, share d memory, wh ich i s o n-chip and much faster than the local and glob al memory. Accessing shared memory is f ast as long as there are n o bank conflicts between the threads, and the access latency will be only 1/100 of the access latency of local and global memory. The texture memory space resides in de vice memory and is cached in t exture cache. T his can p rovide h igher bandwidth for the lo cality in the texture fetches. The text ure memory is very s uitable for processing image an d lo ok-up tab le, be sides it also s peeds up t he random an d n on-alignment access t o large amounts of data [7] . For these reasons, the proposed parallel algorithm binds the current frame and the reference frame to the t exture memory; the int ermediary results produced in kernel are stored in registers and shared memory to further speed up memory access.
Programming model
As shown in Fig.2 , CUDA programming model is divided into three levels: thread, thread block and grid. Threads execute data parallel computations of the kernel and ar e clustered into blocks of threads referred to a s thread blo cks. These thread bloc ks are further cl ustered into grids. During implementation, the designer can configure the number of threads that constitute a block as well as the number of blocks t hat con stitute a gr id. Ea ch th read inside a bl ock h as i ts own registers and lo cal memory. The threads within a th read block can co-work with each other through the shared m emory and can synchronize their execut ion to coordin ate their memory access . Neverth eless, threads in different thread blocks are unable to access the same shared memory and thus, they run independently.
When a CU DA p rogram on t he h ost CPU invokes a kernel g rid, t he bl ocks of the grid are enumerated and di stributed to multiprocessors w ith availa ble ex ecution capac ity. The threads of a thread b lock e xecute co ncurrently o n one multiprocessor, a nd multiple thread b locks c an execute concurrently on one multiprocessor. Each thread block is split into SIMD (Single-Instruction-Multiple- 
Full search algorithm for motion estimation
Block-matching motion esti mation is widely ado pted by variou s vi deo co mpression stan dards to exploit the hi gh t emporal redun dancy a mong the successive f rames. It divides frames i nto nonoverlapping blocks o f equa l size and finds out the d isplacement o f the b est-matched b lock from the reference frame as th e motion vect or to the bl ock in the curr ent f rame within a search wi ndow. Matching is performed by minimizing a matching criterion, which in the most cases is the sum/mean of absolute difference be tween a pai r of b locks and the best-matched block produces the minimum distortion.
The full sear ch (FS) algorithm is the most straightforward and accurat e o ne t o find the opti mal motion vectors by exhaustively evaluating all possible candidate points within the search window. In searching f or the b est match, t he correlation window i s moved to eac h c andidate position within t he search window. There are a total (2p+1) × ( 2p+1) positions that need to be examined, where p is the search range for the block. The m inimum dissimilarity gives the best match. While it gives the global optimum solution to the motion esti mation, a substantial amount of computational load is demanded, which limits its application in real-time compression video.
Parallel full search algorithm for motion estimation
H.264/AVC employs tree structured motion compensation algorithm. The basic unit in H.264/AVC motion es timation pr ocess is a 16×16. The cu rrent f rame can be partit ioned into 16×16 MBs. Each 16×16 MB can be split into 16×8, 8×16, and 8×8 blocks. And the 8×8 block can be further partitioned into 8×4, 4×8 and 4×4 blocks. The MB partition is shown in Fig.3 . After macroblock partition, the best motion vectors ( MV) of each possibl e mode ( block si zes) a re calcul ated. Due to the numerous candidate modes, t he H.2 64/AVC motion estimation a lgorithm is gene rally extremely co mplex a nd time-consuming. Therefore, Chen and Hang's algorithm s tarts multiple ker nels to c ompute and c ompare the SAD values of different sub-block modes parallel [13] . Th e algorithm firstly divides a M B into sixteen 4×4 blocks, and the SAD value of each 4×4 block is calculated in parallel for all candidate motion vectors (positions) within the search range on the reference frame. Then, the 8×4-SAD and 4×8-SAD costs are obtained by ad ding two 4×4-SAD costs. The 8×8-SAD costs are obtained adding two 4×8-SAD costs. The 16×8-SAD and 8×16-SAD costs are o btained adding two 8×8-SAD costs, and finally the 16×16-SAD costs are obtained by adding two 8×16-SAD costs. As a result, the SAD costs of different subpartition should be transferred between different kernels. Due to threads in different thread blocks run independently, each ti me the SA D costs produced by kernels must be transf erred back to the dev ice DRAM in order to be reused by other k ernels. The refore, i t has l arge l atency of memory access. I n contrast, our parallelized approach integrates the whole motion esti mation process in a si ngle CUDA kernel. No intermediate data need to be sent back and forth between host and device memory, and the limited bandwidth between them will not be a bottleneck that slows down the algorithm. 
Implementation of parallel full search algorithm for motion estimation on CUDA
Firstly, the reference frame and the current frame are loaded into the texture memory, which resides in device memory an d i s cached in texture cach e t o speed u p memory access. Besides h igher bandwidth, we can also make use of texture memory's linear filtering and automatic type conversion to invoke th e no n-programmable har dware r esources, avoi ding occu pying the prog rammable u nits [7] . Thus, for boundary pixels of a frame, we can directly employ special functions of texture memory to simplify the processing of the out-of-range texture coordinates.
Our parallelized approach maps motion est imation of a macroblock t o a th read bl ock in CUDA programming model. Therefore, if the resolution of the frame is Width Height  , the total number of thread blocks is:
Width Height BLOCK NUM  
Regarding the limited size of the registers and shared m emory, only 256 threads are set up in a thread block to parallel carry out motion estimation of a macroblock. Then, invoke the kernel. Initially, macroblock (MB) and s earch area (SA) pixels were cached into the shared memory from the texture memory based only on the block ID and thread ID, as illustrated in When MB a nd SA pixels were already cached i nto the shared memory, th e threads started to calculate t he SAD values for t he basic 4×4 bl ocks. Th ere ar e 16 16
Width Height  MBs in a f rame. T he 32×32 search window leads to 1024 candidates positions (MVs) for each macroblock. Our parallelized approach maps motion estimation of a macroblock to a thread block and different candidates within the search window to a thread in the thread block in CUDA p rogramming model, as illustrated in Fig.5 . Considering th at the threads in a thread block ar e 256 , thus each thread in a thread bl ock neede d to compute SAD values of 4 candidates. The S AD val ues f or 4×4 sub -block i s co mputed u sing __sad(x, y , z ) in trinsic in struction. Our experiments show that th e best performance is ob tained when this instructi on is used, in co ntrast to alternative approaches which use integer or float instructions. In our method, whole full-pixel motion estimation for a MB is integrated in a single CUDA kernel. So, the SAD values obtained for 4×4 subblock ar e stor ed in registers and i mmediately r eused for hierarchical co mputing th e SAD values for larger sub-blocks. The amount of memory access is considerably reduced.
After variable block-size SAD costs are generated, all 10 24 SAD costs of one block are compared and the one with the least SAD is chosen as the integer-pixel motion vector. Since the threads within a thread block can co-work with each other through shared memory and can synchronize their execution to coordinate their memory access, the propo sed approach uses shared memory to cache intermediate results when comparing SAD values, minimizing the device DRAM access. First, each thread reads in 4 SA D valu es f rom the regi sters and p roduces a least SA D. These temporary SAD values and the ir indexes are s tored in t he sh ared memory. T hen, we acti vate 128 threads and each of them compares two SAD values. The smaller one is stored back to the shared memory. In the next iteration, the thread number is halved. This process is repeated until the final winner (best) is obtained (see Fig.6 ) . In the process, a SAD value and a MV were packed in 32-bit integers; the low 16-bit stores the SAD value and the high 16-bit stores MV. After 7 iterations, the smallest SAD is identified, and the corresponding MV is stored back to the host memory. 
Experimental results
In o rder to d emonstrate the performance of th e a lgorithm proposed i n thi s p aper, th e following development environment is used: Intel Core2 6320 1.86 GHz wit h 1 GB memory, NVIDIA GeForce 8800 G TS w ith 320MB D RAM, CU DA t oolkit and SDK 2.0 and NV IDIA Drive f or Microsof t Windows X P(178.15). The popular MPEG test sequ ences, QCIF, C IF and 4 CIF ( 100 frames), a re examined with a 32×32 search window.
Our paralleli zed ap proach in tegrates th e who le full-pixel motion estim ation p rocess i n a single CUDA ker nel. Wi th t he help o f the h igh-speed r egisters and s hared memory, the a mount of dev ice DRAM accesses is greatly reduced. As shown in table 1, when caching SA pixels in shared m emory, the processing time greatly decreased, nearly 12 times. For f urther c omparison, ex perimental r esults ar e presented in table 2, tabl e 3 a nd tab le 4 respectively. From these tables, we can see that our implementation on CUDA demonstrates substantial improvement over the CPU counterpart. In addition, the speed-up will be greater if the application is more c omputationally. Fo r e xample, the sp eed-up o f 4CI F f ormat is greater th an which o f CIF and QCIF formats. Integer pixel SAD comparision 12 9
Total 48 74
Conclusions
In this paper, we proposed a new CUDA based paralleli zed approach to implement the most timeconsuming H.264 coding process, FS motion estimation. CUDA is a powerful GPU architecture, which offers parallel co mputation cap ability through hundreds of h ighly deco upled p rocessing co res t o accelerate ar ithmetic in tensive applications. In our proposed algo rithm, whole full-pixel motion estimation for a MB is integrated in a single CUDA kernel, parallel calculating and comparing variable block-size SAD valu es. At t he sa me t ime, t his method t akes full adva ntage o f h igh-speed on -chip memory o f GPU, such as registers and s hared memory to minimize the a mount o f d evice D RAM access. Experimental results show that the proposed approach can be 50 times faster than the traditional CPU implementation; and even 4CIF sequences are close to real-time applications.
