Abstract-The high efficiency video coding standard provides excellent coding performance but is also very complex. Especially, the intra mode decision is very time-consuming due to the large number of available prediction modes and the flexible block partitioning scheme. In this paper, a highly parallel intra prediction algorithm for heterogeneous CPU+graphics processing unit (GPU) platforms is proposed, which accelerates the encoder dramatically. It is targeted toward high-quality high definition (HD) and ultra HD applications and utilizes prediction based on original samples (POSs), where the reference samples are generated from original pixels. This makes it possible to perform intramode prediction for all prediction blocks of a video frame concurrently. In addition, parallel-friendly cost functions are proposed which enable parallel rate distortion optimization with no synchronization overhead. A detailed statistical analysis of both POS and the proposed GPU intramethod is provided and the coding performance of the presented prototype is evaluated based on a large amount of experimental data. It is shown that the complexity of the intramode selection on the CPU is reduced by up to 78.03%. This translates to significant encoding time reductions of up to 64.52% for a single-threaded encoder and up to 94.82% in combination with wavefront parallel processing. In high bitrate ranges, average rate increases of only 2.11%-4.26% and 0.80%-2.34% are observed for the proposed high-speed and high-quality configurations, respectively. Furthermore, GPU intra is shown to be extremely efficient in lossless coding scenarios, where up to 53.37% time is saved with an average bitrate increase of only 0.55% among all test cases.
Display devices with 4K UHD resolution are already available and it is merely a matter of time until 8K capable televisions and monitors become the norm. In order to support such high resolutions in today's networked consumer electronics devices, state of the art video coding standards are necessary. The High Efficiency Video Coding (HEVC) standard was designed with the described scenario in mind. It was developed by the Joint Collaborative Team on Video Coding (JCT-VC) and the first draft was completed in early 2013. The standardization process steadily pursued a very ambitious goal: HEVC should provide a 50% better compression performance than its direct predecessor the H.264 Advanced Video Coding standard (H.264/AVC). This goal has been achieved and it is hence very likely that HEVC, with all its tools and extensions, will play a significant role in the foreseeable future [1] . It has been furthermore shown that HEVC can compress UHD videos with excellent quality under existing bandwidth constraints, where test subjects were barely able to tell the difference between the uncompressed source material and the coded videos [2] .
Like most other video codecs, HEVC features both intraand inter-frame prediction. Optimal performance is of course only achieved when both tools are utilized to the fullest. Nonetheless, there are a number of use cases in which intraonly configurations are required. These include ultra low delay applications like video game streaming services where players send input commands to dedicated servers and receive live video footage of the game they are currently playing. Furthermore, intra-only coded videos allow for better recovery from data loss in wireless networks, since every individual frame is self-contained. Lossless compression is a potential additional requirement in many environments, which imposes additional challenges on the employed video codecs.
Fortunately, HEVC's intra module and its lossless coding mode can cope with these demands. Cai et al. [3] have compared the intra and lossless performance of HEVC with numerous state-of-the-art intra-only video codecs and found that HEVC is particularly efficient at higher bitrates and that it can provide significant coding gains. Another study conducted by Nguyen and Marpe [4] has investigated HEVC's potential for the compression of photographic still images. Again, the new standard provides excellent performance and proves to be extremely efficient and useful.
All previous approaches to accelerate HEVC intra encoding focus on the acceleration of a single-threaded encoder and do not target parallel processing (see Section III). However, modern hardware is increasingly parallel, even in the realm of 0018-9316 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
size-and power-constrained mobile devices. Home PCs even have the capability to perform massively parallel computations if both their multi-core CPU and their Graphics Processing Unit (GPU) are utilized efficiently. Parallel intra encoding is, however, very difficult and challenging because of the strictly sequential nature of the employed algorithms. Every step of the intra Mode Decision (MD) heavily relies on various data from adjacent PUs. It is first necessary to wait for the neighboring blocks to be fully coded and reconstructed since their pixels are needed to fill the reference sample arrays, and, in turn, to form the predictions. Then, a Rough Mode Decision (RMD) stage must be able to derive the Most Probable Modes (MPMs) (see Section II) based on the completed mode decisions of nearby PUs. Afterwards, the entropy context present after the last block was coded is needed to perform Rate-Distortion Optimization (RDO) in order to select the best mode. This led to the motivation for this work, which presents a parallel HEVC intra prediction algorithm for heterogeneous CPU+GPU systems. The key contributions of the proposed framework in this paper are
• the Prediction using Original Samples (POS) for certain stages of the MD process which for the first time in an HEVC encoder is entirely executed on the GPU, as well as the analysis of the impact of POS to the RD results, • a Parallel RDO (PRDO) scheme that entirely runs on the GPU and uses cost functions that have been adapted to massively parallel evaluation and do not depend on Context-Adaptive Binary Arithmetic Coding (CABAC), as well as the analysis of its impact, • a flexible yet scalable integration of one or more GPUs and one or more CPUs with minimal synchronization overhead. To the best of the authors knowledge, this work is among the first to present a GPU intra algorithm for HEVC. Since this work concentrates on intra prediction, all comparisons were carried out in respect to intra-only encoders. It is demonstrated that GPU-support alone can reduce the encoding time by up to 64.52% and that the full framework running on a multi-core CPU and four GPUs reduces the encoding time by up to 94.82%. This corresponds to speedups of 2.82 times and 19.32 times, respectively. The acceleration is achieved with acceptable Rate-Distortion (RD) degradation, especially in high bitrate scenarios. Furthermore, methods are proposed that sacrifice some speed gain for better RD performance. The average bitrate increase is thus kept between 0.80-2.34% for high-quality use cases. Finally, the lossless encoder setting is accelerated by up to 53.37% with a relative rate increase of only 0.55% on average. The Prediction using POS and also the PRDO scheme lead to very good results for high quality bitrates and for lossless coding.
The remainder of this work is structured as follows. In Section II the flexible intra mode decision approach of HEVC is described. Section III explains the challenges of parallel intra encoding and analyzes related work. The performance impact of prediction based on original samples is inspected in Section IV by means of statistical investigation. In Sections V and VI the integration of the proposed algorithm into the HM-13.0 reference encoder and the communication between the CPU and the GPU is described, respectively. The GPU intra algorithm itself is outlined in Section VII and its performance in lossy and lossless coding configurations is investigated based on a large set of experimental result data in Section IX. Finally, conclusions are drawn in Section X.
II. BACKGROUND
The high efficiency of the HEVC standard is enabled by a flexible block partitioning approach in combination with a big number of intra modes and transform sizes. More specifically, HEVC subdivides images into Coding Tree Units (CTUs) with a maximum size of 64 × 64 luma samples. These are further subdivided into Coding Units (CUs) using a quadtree structure with a maximum depth of three. Every CU contains a Prediction Unit (PU) and a Transform Unit (TU), which allows handling prediction-and transform-operations separately. PUs are further split into Prediction Blocks (PBs), whereby only 2Nx2N and NxN splitting is allowed for intra blocks. Furthermore, NxN partitioning is restricted to CUs of the minimal possible size 8 × 8. TUs are subdivided into Transform Blocks (TBs) using a quadtree structure with possible transform sizes of 32 × 32, 16 × 16, 8 × 8, and 4 × 4. More information can be found in the respective encoder description [5] .
The intra prediction module of HEVC follows the same paradigm as the one used in H.264/AVC, but offers increased flexibility and more options. First, a reference sample array is constructed based on the reconstructed pixels of spatially adjacent blocks. The array is then used to create prediction samples whereby 35 modes are available for every block size. These include a planar and a DC mode to estimate smooth image content, as well as 33 angular modes. The standard also defines how MPMs can be derived based on neighboring blocks. This allows for more efficient mode signaling and can also be used to accelerate the MD process in the encoder. The chroma mode of a PU may be determined based on its respective luma mode and can even be derived directly from it [6] .
To achieve the best coding performance, it would be necessary to test every possible mode with RDO. This would, however, be too complex considering that many possible combinations are available. The HEVC Reference Test Model 13.0 (HM-13.0) [5] therefore reduces the number of candidate modes for RDO in a RMD process. Here, all 35 modes are tested and the N most promising ones are selected based on the Sum of Absolute Transformed Difference (SATD) between the original pixels and their respective prediction samples. N is set to {3, 3, 3, 8, 8} for the PB sizes {64×64, 32×32, 16×16, 8×8, 4×4}. Afterwards, the MPMs are added to the candidate list which is then processed using full RDO. TU splitting is limited to the maximum size in this stage. The full TU structure is only determined for one final mode in the concluding Residual Quadtree (RQT) stage [7] .
III. RELATED WORK
Even with the RMD optimization, the encoder complexity is still considerably high. Ye et al. [8] propose an alternative RMD method where a list of candidate modes is created based on the PU's gradient information. They report an average time reduction of 38% with reasonable coding losses. In [9] , Shi et al. use edge information based image preprocessing as basis for a bottom-up PU partitioning decision scheme and the subsequent prediction mode selection. Their approach saves 21% time on average while causing only negligible losses. Both works use the sobel operator to derive gradient information as basis for their algorithms. The proposal of da Silva et al. [10] exploits the correlation between the edge information of already coded PUs at lower tree depth levels to reduce the number of candidate modes for RDO. This leads to an average time reduction of 38% at comparatively small bitrate gains. Ismail et al. [11] state that the intra algorithm of HEVC spends 17% and 42% of the time in the RMD-and RDO-stage on average and therefore both stages need to be accelerated to achieve optimal performance. Their approach addresses this by analyzing PU correlations to build a candidate mode list and for early termination of the RDO stage. However, the achieved improvements are still around 28% time reduction. Shen et al. [12] have statistically analyzed the intra MD algorithm. They found that there are strong relations among RMD, MPM, the best modes of adjacent CUs, and the final MD. Based on these findings, they propose an algorithm that can save around 21% computational complexity. An alternative scheme built around statistical observations is proposed by Zhang and Ma [13] . It is based on Hadamard cost and early RDO skip and can save around 60% encoding time with acceptable RD degradation. Recently, Hu and Yang [14] have presented an outlier-based fast intra MD algorithm with entropy coding refinement, which can save 50% time with almost no performance degradation. Interestingly, they obtain the outlier information from the original frame, thus showing that reasonable prediction data can be derived from the pixel domain. However, their approach requires changes to the decoder because of the modified CABAC.
HEVC is still a relatively new standard and only a few studies on parallel intra prediction exist, even fewer exist on massively parallel GPU-based algorithms. As an inspiration for our scheme, it is reasonable to analyze related work dealing with parallel intra in H.264/AVC, since the underlying challenges are comparable. Cheung et al. [15] have identified that parallel intra MD is nontrivial for two reasons. Firstly, reconstructed data of neighboring blocks are required, and secondly, bitrate estimation relies on context-based entropy coding. They propose a GPU-encoder with greedy-based block processing order, whose bitrate is estimated based on analytical expressions. The encoder is reported to be 80 times faster than the H.264/AVC test model with small RD performance degradation. Another H.264/AVC GPU encoder is presented by Su et al. [16] . They exploit parallelism both between and within macroblocks. They report high degrees of achievable concurrency with the encoder being more than 20 times faster than the H.264/AVC test model. However, they also explain that the communication between the CPU and GPU heavily bottlenecks the system. The discussed approaches have recently been further developed by Jiang et al. [17] . Their system relies on finer-grained parallelism, unified predictor functions, lookup tables, and fast MD, and achieves even higher performance.
Some related work about HEVC are covered in the following. Zhao et al. [18] explain how CTU level parallelism can be achieved and present a respective prototype based on a simplified open source encoder. However, their approach is very similar to Wavefront Parallel Processing (WPP), which already is an integral part of the HEVC standard. A more sophisticated system is proposed by Yan et al. [19] , who utilize parallel intra prediction units, CTU level parallelism, as well as CTU size selection based on a Support Vector Machine (SVM) classifier. An eight times speedup compared to sequential execution is reported running on a 64-core processor. The VLSI architecture presented by Tung et al. [20] increases the degree of parallelism by utilizing original prediction samples for certain PBs which would otherwise stall the system. They claim to achieve a high theoretical throughput but unfortunately do not present any data about the impact on the RD performance. In this work, the entire algorithm is designed around the concept of prediction using original samples and a comprehensive analysis of the impact this has on the MD, as well as on the RD performance, is presented. Worth mentioning is furthermore the work of Sheng et al. [21] . They have identified that the RDO algorithms employed in HEVC are very hardware-unfriendly and propose two lowcomplexity alternatives based on quantized coefficients and transformed coefficients, respectively. They report theoretical ROO complexity reductions of 46% and 64%, compared to HEVC's original RDO procedure. Similar approaches for RD cost determination, though more specifically tailored towards GPUs, are presented in Section VII of this work.
It can be concluded that a number of useful studies dealing with highly-parallel H.264/AVC intra coding exist, which show the great potential of such systems. At the same time, much work remains to be done in terms of efficient parallel HEVC intra prediction. This work presents a novel and highly-parallel intra prediction algorithm for HEVC running on GPUs. Furthermore, the impact of MD based on original samples is investigated thoroughly since this is the basic underlying concept of the presented architecture.
IV. ANALYSIS OF PREDICTION USING ORIGINAL SAMPLES
One of the two limiting factors for parallel intra MD, as identified in Section III, is the reliance on reconstructed samples from neighboring blocks. These samples are needed to fill the reference sample array, which is then extrapolated using the 35 different modes to form the respective predictions. If, on the other hand, original samples are used instead, the order dependency between the PBs is completely resolved. It is therefore possible to create predictions for all blocks in a frame in parallel. This does, however, lead to slightly different results because the reconstructed samples are subject to lossy compression.
A. Experiment Setup
An experiment was conducted with a modified version of HM-13.0, which uses POS in the RMD and RDO stages. RQT is then applied to the final selected mode for every PB using reconstructed samples as usual. Every block is hence properly coded and reconstructed and no drift is introduced. Only the mode selection is made based on original samples, not the subsequent encoding and reconstruction. The mode decisions for every PB size are statistically compared with the ones made by the unmodified HM-13.0. Three HEVC test sequences Kimono (a), BQTerrace (b), and Cactus (c), all with a resolution of 1920 × 1080, were used as test material. Those were chosen because sequence (a) has very little texture, sequence (b) has rich texture and strong directional patterns, and sequence (c) has rich texture with only some directional patterns. The underlying assumption is that very detailed areas suffer heavily from POS, while the general directionality of the image remains intact, even at low bitrates. The first eight frames of each sequence were considered for the statistics. It was confirmed beforehand that incorporating more frames only has negligible impact on the obtained results. A large set of Quantization Parameters (QPs) {7,12,...,47} was considered in order to cover even the most extreme cases where the original and the reconstructed samples are mostly similar or vastly different, respectively. Fig. 1 illustrates the results.
B. Statistical Metrics
The following metrics were utilized for the analysis. All results are expressed in percentages, indicating how often the applied metric gave the same result for both encoders at a certain PB size. 
C. Statistical Evaluation
The results of RMDO, shown in the top row in Fig. 1 , clearly indicate that the decisions from POS are overall fairly accurate in the RMD stage. However, the accuracy largely depends on the PB size, where large blocks are predicted with much greater precision than smaller ones. It can also be observed that larger QPs lead to higher discrepancies between the two encoders as the graphs generally descend to the right. This is not surprising, since the original and reconstructed samples differ more with larger quantizers. Some sequencespecific conclusions can be drawn as well. Because sequence (a) has little texture and almost no directional patterns, POS more often makes different decisions, even for large PBs. It is furthermore very interesting that the RMDO of small blocks reaches an equilibrium level from medium to large QPs for sequences (a) and (c). Apparently, the prediction error for small PBs does no longer increase above QPs of around 22-27. Sequence (b), on the other hand, shows very different characteristics. As mentioned earlier, it has very strong directional patterns. Here, all block sizes are almost affected linearly with increased QPs. Overall, the RMDO results lead to the assumption that videos with rich texture suffer very little from POS at low QPs, while less detailed images work best at medium QP ranges.
The RDOO data, displayed in the middle row, reveals that the RDO stage is more strongly affected by POS than the RMD. The overlapping is below 80% for all block sizes and QPs among all sequences. However, the accuracy is considerably less dictated by the quantizers than before. The graphs of sequence (c), featuring very rich texture overall, are almost flat lines. The discrepancy between different PB sizes is also slightly less significant than in RMD, though larger blocks still work better than smaller ones. Judging the RDOO results so far, it would seem that the coding performance does indeed suffer from POS, but is at the same time not as QP-dependent as one would initially imagine.
How precise the angular modes are predicted is revealed by the RDOADD diagrams, shown in the bottom row. Clearly, the angular distance distribution is considerably more sequencedependent than the other two metrics. For (a), the same angle is mostly only selected for large PBs at medium QPs. However, the angular distance (AD) is still 1 or less in over 90% of the cases. One relatively unexpected tendency is that very small QPs, of 22 or below, are apparently not ideal for POS. This is most likely due to microdetails in the image, like, e.g., film grain, which strongly affect the frequency of the transformed coefficients in high bitrate scenarios. Therefore, even the slightest misprediction done by POS, has considerable impact on the RD cost and therefore also on the respective MD. In contrast, sequence (b) gives significantly different results than (a). The same angle is selected over 50% of the time even for the smallest blocks. PBs of 64 × 64 are even predicted with over 90% accuracy for almost all QPs. This data indicates that the general directionality of an image does indeed remain mostly intact with lossy coding and can hence very precisely be predicted using POS. Finally, the results of sequence (c) appear to be in-between the ones of (a) and (b) so all of the previously made observations apply to varying degrees.
Finally, the direct comparison of RDOO and RDOADD indirectly reveals that angular modes are way more accurately predicted than the modes DC and planar. Take sequence (b) as an example. The absolute best mode is only selected less than 80% of the time but the exact same angular mode is chosen in over 90% of cases. This can only mean that POS causes greater prediction differences for DC and planar. This might be caused by the fact that all samples in the reference array equally contribute in fine nuances to the DC and planar predictions. More specifically, DC averages all reference samples to a single value and planar averages the predictions from both the horizontal and vertical part of the reference sample array to produce a smooth pattern. In contrast, angular prediction modes are used to reproduce contrast rich directional patterns and edges. Due to the high contrasts involved, the slightly different reference samples do not matter as much here and still lead to fairly accurate predictions. Simply put, POS works well for distinct directional patterns with high contrast.
In concluding this analysis, it can be noted that POS has clear statistical impact on the intra MD. The analysis does not show, however, how POS affects the coding performance. This of course depends on how wrong the mispredictions actually are in terms of RD. For example, an angular MD that is only 1 mode off, might still make very little difference, since the angular distance between neighboring modes is very small. Additionally, the RD performance of HEVC is generally very high, so slight displacements can very well be acceptable in many use cases. A comprehensive discussion of how POS, as used in the presented GPU intra algorithm, affects the coding performance is presented in Section IX.
V. INTEGRATION OF GPU INTRA
The proposed algorithm pre-calculates intra prediction for all PBs of a video frame in a highly-parallel manner. It does so based on original pixels and by employing cost functions which do not rely on CABAC. Like the original encoder, it first performs RMD using SATD. Afterwards, PRDO is used to calculate the final mode costs. For every PB, GPU intra thus returns a RMD list and the PRDO costs of all contained modes. This result data is then used to speed up the MD process conducted by the CPU. Where usually RMD and RDO would be performed, the pre-calculated GPU result data is used instead, thus significantly accelerating the selection procedure. To identify the best mode, only the pre-calculated costs need to be compared. The final selected mode is then coded based on reconstructed pixels in the RQT stage as usual. This ensures that no drift is introduced in the coded signal. Fig. 2 compares the original MD algorithm with the CPU part of the proposed one. At that time RMD and RDO results have already been calculated on the GPU as described in Section VII. The diagram also shows different possible configurations, which are explained in the upcoming paragraphs.
The GPU intra algorithm can also be configured to skip the RMD stage. In this case, all 35 modes are tested using PRDO and the costs are calculated respectively. The implementation of PRDO on the GPU is fast and efficient enough so that this does not cause any additional waiting time for the CPU. Still, the complexity of the MD is slightly increased since more costs need to be compared. The described behavior is clearly indicated by the experimental result data and will be discussed more thoroughly in Section IX. It is furthermore possible to select more than one final mode based on the pre-calculated costs. The N selected modes will then be processed using RDO to determine the final candidate for RQT. This does not influence the behavior of the GPU code at all, it is merely an alternative way of incorporating the result data into the MD process. While this obviously puts a larger burden on the CPU, it also leads to a better RD performance. It is often the case that the actual best mode is among the two best modes identified by PRDO. Hence, if N = 2, it is very likely for the most optimal decision to be made, even though PRDO slightly misestimates the costs. The actual impact of this approach on the coding speed and performance will be analyzed in Section IX.
VI. GPU COMMAND SCHEDULING
Programs for heterogeneous CPU+GPU platforms all follow the same principal structure. More specifically, the Compute Unified Device Architecture (CUDA) was used for this work. Basically, a single-or multi-threaded program runs on the host, while GPU functions, called kernels, are executed asynchronously on one or more physically separate devices. The host consists of the CPU and the RAM and each device has a massively parallel GPU and its own Video RAM (VRAM) [22] . Data between a host and a device must be exchanged over the PCIe bus and synchronization is necessary to guarantee the proper completion of kernel invocations and data copy operations. Therefore, efficient load balancing algorithms are required to minimize scheduling overheads in order to achieve optimal performance [23] . The scheduling algorithm presented in this work meets these requirements and supports any number of GPUs in a scalable manner.
Firstly, every video frame is logically subdivided into lines of CTUs. These build work packages for the devices to process and are referred to as jobs from hereon after. Full lines are used for 1920×1080 sequences and half lines for 3840×2160 videos. The work package size is thus equal in both cases. A dedicated command thread is employed to schedule the jobs for the devices. The thread processes one job after another and issues all respectively associated commands to an available idle device. These include kernel invocations as well as asynchronous data copy instructions. For optimal GPU resource utilization, the commands which comprise each job are inserted into multiple device streams to enable the overlapping of kernel execution and data transfers. For every job, a considerable number of kernels with various thread numbers are scheduled (see Section VII for details), which leads too an adequately high GPU utilization and optimal kernel overlapping. After the kernels and the data copy instructions, an asynchronous cudaStreamCallback is inserted, which will be triggered by the device after it has finished its work. The command thread later receives the callback and marks the result data of the CTU line as available for the CPU to read. A hard device synchronization with the function cudaDeviceSynchronize is only performed after all jobs are finished and their data has been fully transferred to the host. More specifically, hard synchronization only occurs once for every GPU at the very end of each frame. Therefore, the CPU never has to wait for any kind of blocking CUDA calls, which guarantees minimal waiting time and thus optimal performance.
The main encoding thread runs on the host as usual. It checks if the aforementioned result data is marked as available for every CTU it encounters. The jobs are finished quickly enough by the devices so that the CPU usually never has to wait at this point. To avoid synchronization overhead at the very beginning of a frame, the first four CTUs of the topmost line are compressed without the use of GPU result data. Hence the fifth CTU is the first possible synchronization point. In a wavefront parallel encoder, the first two CTUs of every line are handled as just described. This way, single-and multi-threaded encoders can profit from GPU intra without any synchronization-induced idling.
Another considerable advantage of the utilized job system is that it helps to minimize the amount of device memory required to process the prediction algorithm. A device's VRAM is logically subdivided into global and constant memory and is of course a limited resource. Simultaneously, global memory is needed to store input and output data for kernels. For example, the kernel that derives the reference samples reads original image pixels from the global memory and stores its result samples there for subsequent kernels to use. The total required storage space can easily become very large since many PBs are processed in parallel. This is especially true for UHD video sequences. Fortunately, every device only processes one job at the time and therefore only needs enough space to store the data required to compute one CTU line. The presented algorithm thus only needs less than 37 MB of global memory to process 1920 × 1080 video sequences. This is an easily affordable amount of VRAM, even on older graphics accelerators.
VII. GPU INTRA ALGORITHM
In order to process a job as described in Section VI, a number of kernel invocations, synchronization barriers, data copy instructions, and a terminating device callback is scheduled to a GPU by the command thread. More specifically, five kernels named initAdiPattern, createPrediction, fillRMDModeLists, createResidual_fTQ_iQT, and calculateSplitRDCost form the entire algorithm. An overview of the respective command and data flow is illustrated in Fig. 3 , while a detailed description of the individual kernels, their configuration, and the in-between transitions is provided in the upcoming subsections.
The CUDA [22] is the employed GPU platform, which has the following implications. A multi-threaded GPU program is organized in blocks of threads, which can both be one-two-, or three-dimensional. The blocks, as well as the contained threads, are identifiable by unique numerical indices, which are typically mapped to parts of the data to be processed by every thread. The thread configuration is individually applied for every single kernel. A physical CUDA GPU consists of a number of Streaming Multiprocessors (SMs), each with a set of computational cores, a set of 32-bit registers, and 48KB of parallel data cache called shared memory. The thread blocks are assigned to the available SMs for execution and are scheduled by them in units of 32 threads called warps. Thread blocks which are N × warpsize in dimension can be processed most efficiently by the hardware. The performance of a kernel is predominantly dictated by how efficiently the available GPU resources are utilized. More information can be found in the CUDA programming guidelines [22] .
A. Reference Sample Generation From Original Pixels
To fill the reference sample arrays, the kernel initAdiPattern is invoked for every PB size. It is configured to use one uni-dimensional thread block for every PB with ceil 32 (5 × w) threads. The function ceil 32 rounds up to the next multiple of 32 and w is the width of the PB. Every thread therefore computes one pixel of the reference sample array. First, the kernel determines the availability of the neighboring PBs and stores the information in shared memory. The availability is dependent on the position of the current PB in the picture and on its Zig-Zag scan Order (Z-Order) inside of the CTU. To keep divergent branching of the code to a minimum, lookup tables with Z-Order information are maintained in the constant memory. The tables are extended by two additional rows and columns to simplify the handling of border cases. For example, the table for 32 × 32 PBs looks like this:
Here, the values {0,1,2,3} correspond to the respective Z-Orders of the four 32 × 32 PBs contained in the CTU and the borders are filled with {-1,4}. The availability of a neighboring PB can then be easily tested by checking whether the Z-Order of the respective adjacent PB index is smaller than the one of the current PB. It should be noted that actually all pixels are always available because original samples are used. However, in order to form more accurate predictions, the algorithm follows the same restrictions and obeys the same rules as the reference implementation based on reconstructed pixels. After the availabilities are known, the reference sample array is filled, whereby thread indices are used instead of sequential loops over the data buffers and synchronization barriers are inserted between the processing steps if necessary. Some parts have been reformulated to be more parallel-friendly. The reference sample substitution, for example, is implemented with three nested loops in HM-13.0. On the GPU, three individual loops are used instead: two to identify the latest available sample in the horizontal and vertical direction, and one to fill the unavailable samples with the respective substitution values in a parallel fashion. In all steps, temporary data is stored in the shared memory and the final sample array is written to the global memory to be available for subsequent kernels. Finally, the filtered version of the sample array is generated using the same principles as just mentioned and is also stored in the global memory. To sum up, the kernel outputs the filtered and unfiltered version of the reference sample array for every PB. It should be stressed at this point, that the output data is completely identical to what the reference encoder would produce when operating on original pixels, like the modified version of HM-13.0 explained in Section IV.
B. Parallel Optimized Intra Sample Prediction
The kernel createPrediction is responsible for the generation of the predictions from the reference sample arrays. It is invoked once for every PB size, internally loops over all 35 modes and creates the respective predictions. Either the filtered or unfiltered sample array is used as input and post-processing is applied to the prediction results if necessary [5] . Again, the output data is equivalent to what HM-13.0 with POS would produce. The kernel uses the shared memory to store the original pixels of the PB, for the prediction generation and for temporary data buffers. Of course, the final prediction samples for every mode are transferred to the global memory at the end of every loop iteration, so that the data remain persistent for subsequent kernels to use. Data structures like the prediction angles or the filter coefficients are located in the constant memory.
A technique best described as Thread Index Padding (TIP) is used to guarantee optimal GPU resource utilization. The idea is that every thread index is used to address one output pixel, while at the same time the total number of threads remains flexible. The thread block size is configured to be {w, h/2 p }, where w and h are the width and height of the PB and p is a positive integer number. Based on experimentation, the optimal values for p were found to be {3, 3, 2, 1, 0} for the PB sizes {64 × 64, 32 × 32, 16 × 16, 8 × 8, 4 × 4}. Inside of the kernel code, the thread indices are padded and thus calculated as outlined by the following pseudo code:
for (pady=0; pady<(PBHeight/blockDim.y); ++pady) { y = threadIdx.y * (PBHeight/blockDim.y) + pady; x = threadIdx.x; // operate on data... }
The threads essentially operate on (PBHeight/blockDim.y) chunks of data. Within these chunks the memory addresses are accessed consecutively according to the threads' x indices. With TIP, the thread numbers can therefore easily be limited to configure the system for efficient GPU resource utilization while maximizing coalescing through optimal access patterns. It might seem counter-intuitive to artificially reduce the thread numbers at first; however, using fewer threads per block means more blocks can reside on each SM, which can therefore lead to significantly higher degrees of concurrency and thus much faster processing speed. This is mostly because using fewer threads implicitly reduces the number of registers required by each block, which leads to the described performance benefits.
C. Parallel-Friendly Rough Mode Decision
The RMD is spread between two kernels. The previously explained function createPrediction calculates the costs and subsequently the kernel fillRMDModeLists fills the RMD mode lists and adds the MPMs. The latter is only performed if RMD is activated as explained in Section V.
In [5] , the cost function used for RMD is defined as
where SATD is the absolute difference of the Hadamard transformed samples, λ pred is the Lagrangian constant value, and B pred is the bit cost of the respective MD. The kernel createPrediction computes the cost at the end of every iteration of its mode loop. One thread is used to calculate the Hadamard transformation for every sub-block of the PB and parallel reduction is employed to sum up the final SATD. Unutilized threads are deactivated in this stage. Since both the original samples and the prediction reside in the shared memory, the SATD can be processed very efficiently. Then, thread {0, 0} computes the final cost using (1) and writes it to the global memory. However, B pred cannot be accurately found without CABAC, so the following simple algorithmic approximation is used instead:
function xModeBitsIntra( uiMode,uiPU,uiPartOffset,uiDepth,uiInitTrDepth) { modeBits = 0; if (uiMode==1 ) modeBits += 1; if (uiMode>1 ) modeBits += 5; if (uiDepth==3) modeBits += 1; if (uiPU>0 || uiPartOffset>0 || uiDepth>0 || uiInitTrDepth>0) modeBits += 1; return modeBits; } The variable names in the above pseudo code correspond to the ones used in the reference implementation. In principle, the intra mode index, the CTU depth, and the PB index within the PU all correspond to the final bit cost to varying degrees depending on their value combination. The above algorithm was derived from statistical data based on experimentation. It has the tendency to rather underestimate the actual cost than to overestimate it. Therefore, the SATD has a greater impact on J pred,SATD than in HM-13.0.
The actual selection of the most promising modes is conducted in the kernel fillRMDModeLists, which is directly called after createPrediction. It is configured to use one-dimensional thread blocks of 16 threads each, with one block per PB. The kernel uses thread {0, 0} to iterate over the mode costs and puts N of them into an ordered array in the shared memory. N is set to {2, 2, 2, 5, 5} for the PB sizes {64 × 64, ..., 4 × 4}, respectively. Afterwards, N threads are used to copy the list into global memory. Finally, thread {0, 0} adds the MPMs, if the individual modes are not already contained. Because the real MPMs can not be derived in a highly-parallel environment, modes {0, 1, 10, 26} are always used instead. These correspond to planar, DC, horizontal and vertical, which are in general very likely to produce comparatively good predictions.
As the concluding step of RMD, the resulting mode list is copied back to the host. This is performed asynchronously in an overlapping device stream and the data will be available for the CPU to use after the device callback (explained in Section VI) was received by the command thread. The cost J pred,SATD of the mode decisions is not copied to the host since it is no longer relevant at this point. It was already used to sort the RMD list and all the contained candidates will later be evaluated using PRDO anyway.
D. Transform and Quantization Architecture
Residual creation, forward transform, quantization, dequantization, inverse transform, and reconstruction are all handled in the kernel createResidual_fTQ_iQT. All of these operations involve the same number of input and output samples and can thus be efficiently processed using identical thread numbers. However, the kernel operates on TBs rather than PBs and is therefore only invoked for block sizes of {32 × 32, ..., 4 × 4}. The TIP principle is used again, with p = {2, 2, 1, 0} for the various TB sizes. Firstly, the original samples of the TB are copied into the shared memory. Then, the kernels loops over the modes in the RMD+MPM list. For 32 × 32 TBs, it is necessary to traverse two lists: one containing the 32 × 32 RMD results and one containing the respective 64 × 64 data. This needs to be done because 32 × 32 PBs can belong to a 32 × 32 or can be part of a 64 × 64 PB. Redundant modes are skipped in this process since they do not need to be calculated twice. If RMD is deactivated, all 35 modes are iterated.
Within the mode loop, the prediction samples are firstly copied from the global into the shared memory. All consecutive steps operate exclusively on the shared memory to ensure minimal access latency. The residual samples are created next, and are afterwards DCT-transformed. A regular matrix multiplication approach is used instead of the partial butterfly method implemented in HM-13.0. While the result is exactly the same, the normal matrix multiplication can be computed more efficiently on the GPU because it involves less conditional branching and more straightforward access patterns. The transform matrices are kept in the constant memory for easy access and reuse.
The quantization is performed as explained in [24] while accessing the quantizer scales from the constant memory. However, in contrast to HM-13.0, RDO Quantization (RDOQ) is not applied because this is not possible without CABAC. The Absolute Sum of the Quantized Transform Coefficients (ASQTC) is calculated using parallel reduction, before the signs of the output coefficients are determined based on the signs of the respective input coefficients. If the ASQTC is 0, the signal can directly be reconstructed from the prediction samples. In all other cases, the subsequent dequantization, inverse transformation and reconstruction are performed using the same methodology as before. The reconstructed residual samples, stored in the shared memory, are considered to be the final result data at this point, since they are needed to calculate the distortion, as explained in the upcoming subsection.
A special situation is the case of 4 × 4 TBs. An alternate Discrete Sine Transform (DST) is used here and transform skip is evaluated as well. In the kernel, the DST is treated just as the DCT and every mode loop iteration is repeated once to also test the transform skip case. Consequently, two different versions of the reconstructed residual samples are obtained for every 4 × 4 TB.
E. Parallel-Friendly Rate Distortion Cost Function
PRDO is used to determine the individual mode costs as follows. The costs are calculated in the kernel createResidual_fTQ_iQT using the data located in the shared memory as explained above. In the reference encoder [5] , the RD cost J mode is computed as
where SSE is the sum of square error between the original and the reconstructed pixels, λ mode is the Lagrangian constant value, and B mode is the bit cost of the block when it is encoded with CABAC. The SSE is simply computed in the kernel after the reconstruction step with once again using parallel reduction to determine the final sum. However, B mode must be approximated since it requires CABAC. Therefore, the following modified version of the RD cost function is proposed in order to enable PRDO:
Here, QTC ij refers to a quantized transform coefficient i, j with no RDOQ applied and w, h are the TB's width and height. Conveniently, the absolute sum of the coefficients QTC ij has already been calculated previously since it is the same as ASQTC. The value of ASQTC can therefore be reused for the cost calculation. After the distortion has been determined using SSE, the single thread with index {0, 0} computes the RD cost J mode as just described and writes it to the global memory. For 4 × 4 TBs, two RD costs are determined, one for each of the two different residuals obtained via DST and transform skip, respectively. As a consequence, the CPU later has to consider both costs in the mode decision process. This corresponds to the MD procedure used in HM-13.0, which also covers both transform options for 4 × 4 TBs. The costs for the 64 × 64 PBs need to be computed separately because they are formed from four 32 × 32 TBs. Two additional processing steps are therefore implemented. The kernel createResidual_fTQ_iQT writes the values of ASQTC and SSE into the global memory for all 32 × 32 TBs. Afterwards, another very light-weight kernel with only one thread block, named calculateSplitRDCost, is launched. It uses one thread for every 64×64 PBs and no shared memory. Every thread reads the four ASQTC and SSE values of its respective PB, sums them up, and then computes J mode accordingly. Using such a simple kernel has the advantage of it being able to run in parallel to the subsequent invocations of createResidual_fTQ_iQT. Because the kernel requires only a minimal amount of GPU resources, its processing practically adds no additional computation time.
Finally, the RD costs of every mode are copied back to the host in an asynchronous transfer operation. The device callback mentioned in Section VI, is placed right after the copy command. It is hence triggered immediately after the transfer has finished and the data is available for the CPU to read. This final step concludes the GPU intra algorithm.
F. Lossless Coding Configuration
If lossless coding is enabled, only the kernel createResidual_fTQ_iQT is affected. The same RD cost metrics are used in principle, but transform and quantization steps are of course skipped. The ASQTC therefore becomes the sum of absolute residual samples and the SSE-based distortion is obviously zero since no information is lost due to quantization. The rest of the algorithm works in the same way as described earlier.
G. Algorithm Summary
To summarize, the kernel initAdiPattern determines the reference samples just like an encoder embracing POS would. The predictions for all 35 modes are formed in the kernel createPrediction and it performs RMD in cooperation with fillRMDModeLits. The function createResidual_fTQ_iQT is the most complex part of the algorithm because it performs the entire transform and quantization process. Together with calculateSplitRDCost it furthermore calculates the RD costs using PRDO. All computations are conducted in a highly-concurrent manner and parallel-friendly cost functions are used accordingly. Fig. 3 gives an overview of the complete GPU intra scheme.
The algorithm puts heavy emphasis on shared memory utilization. While using too much shared memory can potentially limit the degree of parallelism, it was verified with the CUDA Profiler [25] that the employed configuration lead to the optimal performance in all cases. Using shared memory is beneficial here, because the memory bandwidth has a larger impact compared to the instruction throughput since mostly integer arithmetic is used. Furthermore, the allocated shared memory is reused in consecutive processing steps and is thus limited to a reasonable amount.
VIII. STATISTICAL ANALYSIS OF GPU INTRA
In order to gain a general understanding of the behavior and characteristics of the presented algorithm, the experiment explained in Section IV was repeated. The same test conditions and metrics were used to compare the mode decisions of the GPU intra encoder with the ones made by HM-13.0. Fig. 4 illustrates the results. In principle, the GPU intra algorithm introduces coding losses because it employs POS and approximated cost functions. It is therefore particularly interesting to compare Fig. 1 , showing the impact of POS alone, with Fig. 4 here. This makes it possible to isolate the influence of the modified cost functions and hence to assess their accuracy and suitability.
The results of RMDO are shown in the top row. RMDO 3 is not evaluated here since GPU intra only selects two modes for large PBs in the RMD stage before adding the MPMs. It can be observed that the mode decisions are almost unaffected by the CABAC-free RMD performed by the GPU. The graphs are extremely similar overall, they just decline slightly more along the QP spectrum. Large PBs are once again less affected compared with smaller ones. Judging from this data, it can be assumed that the RMD stage is highly accurate and should hence cause very little RD degradation.
The RDOO result data, displayed in the middle row, shows much larger differences. Overall, the areas under the graphs are notably smaller, indicating a less accurate prediction. Like before, small PBs are more strongly affected, which is equally apparent for all three video sequences. This phenomenon can be explained when looking at equations (2) and (3), which are used to calculate the RD cost by HM-13.0 and by the GPU with PRDO. The RD-optimized cost J mode is constructed out of two parts, the distortion SSE and the weighted bit cost B mode of the MD in question. For small PBs, the latter is often the predominant factor of the equation and the resulting absolute cost value is relatively small, since not many samples are contributing to the distortion. Hence, even a slight misestimation of the bit cost leads to completely different mode decisions. In addition, the GPU cannot perform RDOQ due to the lack of CABAC. As a consequence, very insignificant coefficients might contribute considerably to the overall mode cost, especially for small PBs in combination with large QPs. Finally, the RDOO results of sequence (a) show very interesting characteristics at very small QPs of 17 and below, where even 64 × 64 PBs are predicted with 20% less accuracy. The widespread blurry areas of the video are coded using predominantly large PBs. It can furthermore be assumed that multiple different modes lead to very similar degrees of distortion in these low texture areas. Therefore, the bit cost also greatly affects the MD here. However, this very likely has little impact on the video quality because of the similar distortion values.
How much the angular mode decisions diverge is indicated by the RDOADD shown in the bottom row. For the sequences (b) and (c), only very little differences compared with the POS encoder can be observed for PB sizes 64 × 64, 32 × 32, and 16 × 16, where the curves have the same form but the areas under them are slightly smaller. It was discovered in Section IV that angular modes can be predicted very accurately using POS, while the non-angular modes planar and DC suffer from larger prediction errors. The RDOADD results therefore indicate that the ASQTC used as an analytic approximation for the bit cost is reasonably accurate if the prediction itself is highly precise. Once again, smaller PBs suffer more for the same reasons explained above. Also as before, sequence (a) is more strongly affected than the other two videos. This further underlines the earlier assumption that multiple different modes can be used to form fairly accurate predictions for blurry areas. When the distortion caused by these modes is comparatively small, the misestimated bit cost once again becomes the dominant part of the cost equation and thus dictates the final decision. However, it can again be assumed that this only has a small impact on the RD performance because of the tiny relative errors involved.
The results of the statistical analysis of the mode decisions made by GPU intra can be summarized as follows. Very little difference compared with a POS encoder can be observed in the RMD stage. Simultaneously, the PRDO stage is strongly dependent on the characteristics of the encoded video. Very different mode decisions are being made for the low-texture sequence (a), while the decision differences are much smaller for sequence (b), which features clear and strong directional patterns. The results obtained for the feature-rich sequence (c) are located in between the other two in terms of their accuracy. It is furthermore interesting to note that the ideal operating zone of GPU intra seems to be in the QP-range 22-37, which is exactly the spectrum suggested in the common test conditions [26] . This indicates that GPU intra can be a valuable fast-encoding option in many popular use cases.
IX. EXPERIMENTAL EVALUATION
A comprehensive set of experiments was conducted to evaluate the practical utility of GPU intra for real-life applications. In this respect, the main considerations are the achieved encoding time savings and the RD performance. The upcoming subsections provide a detailed description of the experiments and an in-depth discussion of their results.
A. Development Environment
The following hardware platform was used to run the experiments: 64-bit Operating System (OS); two CPUs clocked at 2.7 GHz, each with 12 computational cores and HyperThreading (HT) technology; 64 GB RAM; four identical GeForce GTX TITAN Black GPUs running at a base clock rate of 889 MHz, each with 15 SMs and 6 GB VRAM. 
B. Test Sequences
The presented algorithm is designed for high-quality HD and UHD use cases. At the same time, the utilized POS principle performs very differently depending on the characteristics of the encoded videos, as was identified in Sections IV and VIII. Therefore, three HD and three UHD videos with high and low texture detail, and with weak and strong directionality have been selected in order to identify the strengths and weaknesses of GPU intra. All sequences and their distinct properties are listed in Table I . Sequences (a,b,c) are test material used by the JCT-VC and (d,e,f) are from the 4K video dataset provided by Song et al. [27] .
C. Encoder Configurations
The HEVC reference test model HM-13.0 [5] builds the foundation for the proposed GPU intra framework and serves as a point of reference for speed gains and RD performance changes. According to the common test conditions [26] , QPs {22, 27, 32, 37} were used in combination with the configuration profile intra main. In every single experiment, the first 50 frames of each test sequences have been encoded.
GPU intra can be configured in different ways as explained in Sections V and VII. The four different settings listed in Table II have been applied. The configurations (A) to (D) are sorted according to the expected encoding speed, from fastest to slowest. With RMD = off, the algorithm tests all 35 modes on the GPU using PRDO. The number of modes to be tested by the CPU using full RDO Mode Decision (RDOMD) is set to 0 and 2, respectively. The RQT stage for the one final mode selected for every PB is always conducted on the CPU to ensure proper encoding and reconstruction with no drift. The whole idea behind these different configurations is to sacrifice speed gains for better RD performance if needed. Settings (B) and (D) are included to test if GPU intra is able to make better decisions if all 35 modes are tested using PRDO in a brute force fashion. Furthermore, the statistical evaluations in Sections IV and VIII have shown that the proposed encoder sometimes makes different mode decisions compared with HM-13.0. At the same time, these decisions are often very close to the optimal ones, as indicated by the RMDO, RDOO, and RDOADD results. Hence, if the actual best mode is among the two final candidates identified by GPU intra, configurations (C) and (D) would still be able to select the actual best mode using CPU-based RDO.
It is important to consider what time savings are theoretically possible with the presented approach in order to put the achieved results into perspective. The encoding modules RMD, MPM, and RDO are completely or partially offloaded to the GPU as described above. Therefore, the maximum single-threaded speed gain corresponds to the accumulated processing times the HM-13.0 CPU encoder spends in these modules. Figure 5 provides a respective overview for all utilized test sequences. It can be seen that RMD, MPM, and RDO combined make up around 65% of the processing time in all cases. This is therefore the maximum theoretical speed gain that can be expected with GPU intra. However, since the MD is still done by the CPU (see Section V), the actually achievable speed gain is minimally lower. The result data evaluation in Section IX-D reveals that this optimal performance level is actually reached by the proposed GPU intra implementation. This in turn proves that the algorithm is not limited by CPU/GPU communication overhead in any way.
A real product implementation of HEVC would most likely utilize multiple CPU cores to achieve higher encoding speeds. It is therefore mandatory to test GPU intra not only in a single-, but also in a multi-threaded environment. Various parallelization schemes are defined by the standard including slices, tiles, and WPP [28] . WPP is known to cause very little RD degradation because it only directly affects CABAC, while all spatial and temporal prediction dependencies remain intact. Moreover, WPP automatically scales with higher video resolutions since it computes individual CTU lines concurrently. For these highlighted reasons, it was decided to use WPP for the CPU-parallel experiments. The method proposed in [29] was used to enable WPP in HM-13.0, with the additional required configuration parameter WaveFrontSynchro = true. For all single-threaded experiments, only one GPU was activated, while the multi-threaded tests utilized all four.
D. Complexity Reduction and Encoding Time Saves
Two metrics were considered in order to evaluate the time efficiency of GPU intra. The complexity reduction of the intra MD on the CPU and the overall encoding time savings. The Intra Complexity IC is defined as the accumulated time spent in the function TEncSearch::estIntraPredQT. It was measured in HM-13.0 (IC HM13 ) and in the single-threaded GPU encoder (IC GPU ). Based on that, the average Intra Complexity Reduction ICR was calculated as
with i referring to one experiment with a particular QP i ∈ {22, 27, 32, 37}. In a similar fashion, the average Encoding Time Reduction ETR was calculated as
where ET HM13 and ET GPU are the overall encoding times of HM-13.0 and the proposed GPU intra encoder, respectively. For the WPP experiments, ETR WPP was calculated additionally using the same formula, but with the WPP-enabled HM-13.0 [29] as a point of reference. Therefore, ETR WPP shows how much the multi-threaded reference encoder is accelerated by GPU intra. Tables III and IV list the complete set of ICR and ETR result data for the lossy coding configurations in their respective rightmost columns. As shown in Table III , the time reductions are very significant among all experiments. The ICR reaches up to 78.03% and the highest measured ETR is 64.52%. In other words, the proposed GPU intra encoder is up to 2.82 times faster than HM-13.0. Even in the worst case, ICR and ETR are still as high as 57.28% and 47.07%. The highest improvements are achieved with configuration (A), while (D) brings the lowest speed gains. It should be noted here that this is not caused by the higher GPU workload in configurations (B) and (C), nor is it caused by additional synchronization overhead between TABLE III  PERFORMANCE OF GPU INTRA   TABLE IV  PERFORMANCE OF GPU INTRA WITH WPP the host and the device. The highly efficient GPU command scheduling (see Section VI) in combination with the thoroughly optimized GPU intra kernels (see Section VII) make this possible. There was no synchronization-induced CPU idle time observed over the course of the entire experiments. The differences in speed solely originate from the varying degrees of time the encoder spends in the function TEncSearch::estIntraPredQT, as explained in Section V. With setting (A) and (C), only the pre-calculated costs of 4 to 9 modes need to be compared to identify the one mode to be processed using RQT. In contrast, the costs of all 35 modes are considered in configurations (B) and (D), which of course takes a little bit more time. Settings (C) and (D) are notably slower because they identify the best two modes using the pre-calculated costs and then make the final decision between them using RDOMD.
In contrast to the used configuration, the characteristics of the encoded sequences have very little effect on the speedups. Within each experiment group, the maximum difference between the respective highest and lowest gains ranges at around 1%. The texture-rich sequences (b,c,e,f) seem to benefit slightly more from GPU intra in terms of speed, but the differences are overall negligible. Table IV shows the result data of GPU intra in combination with WPP. In principle, the same observations as above also apply here; however, there are certain peculiarities. The time savings relative to the WPP-enabled HM-13.0 are very high, but comparatively smaller with regard to the single-threaded encoder. The reasons for this are two-fold. Firstly, the leftmost two CTUs of every line are encoded without GPU support, as explained in Section VI. This means a lower percentage of the entire picture area profits from the GPU acceleration. Secondly, the speedups achieved for every single WPP thread do not necessarily translate to an equal degree of overall acceleration. With WPP, it is mandatory to maintain an offset of two CTUs between consecutive lines in order to respect the spatial prediction dependencies. This creates an unavoidable overhead between the individual wavefront threads, especially if the CTUs have varying completion times. Consequently, the described phenomenon is slightly more apparent for the 4K sequences (d,e,f), where more threads need to be synchronized. This is a downside of WPP in general and not a shortcoming of GPU intra. Like with the single-threaded experiments, no additional communication delay between the host and the four devices was observed.
The ETR compared with the normal sequential HM-13.0 is very significant among all configurations. Up to 94.82% time is saved, which corresponds to a 19.32 times encoder speedup. Here, the improvements are notably higher for the 4K sequences. This is not surprising, since more WPP threads can be used due to the higher resolution. The speed differences between sequences (a,b,c) and (d,e,f) correspond to the diminishing returns, which are expected with increasing WPP thread numbers [29] . Here, the maximum number of concurrently active threads is 15 and 30 respectively. This shows that the proposed GPU intra architecture is capable of significantly increasing the speed of even highly multi-threaded CPU encoders.
E. Rate Distortion Performance
The RD performance is evaluated based on Peak Signal-toNoise Ratio (PSNR) and average increase in bitrate. Firstly, the PSNR for every single experiment was calculated as suggested in [28] using the following equation:
The result is the weighted average PSNR over the luma and chroma components given by PSNR Y , PSNR U , and PSNR V . The changes in bitrate were evaluated based on the BD-PSNR metric [30] , which computes the average difference in rate or PSNR between two RD curves. However, since the proposed algorithm is targeted towards high-quality HD and UHD applications, the improved version of BD-PSNR [31] is favored. The original BD-metric [30] was designed when quantizer steps of three were commonly used. The improved version [31] allows comparing the distinct differences in low and high bitrate ranges, which makes it better suited for experiments using larger delta QPs, like the one presented here. With the improved BD-metric, three values are obtained for every experiment with QP ∈ {22, 27, 32, 37}, namely %Bit, %Bit low, and %Bit high. The respective results are listed in Tables III and IV , where the two single-and multithreaded encoders are compared, respectively. In other words, the unmodified HM-13.0 serves as a benchmark for the singlethreaded GPU intra algorithm and the WPP-enabled HM-13.0 for GPU intra + WPP. Furthermore, RD plots for the GPU configurations (A) and (C) are provided by Fig. 6 , which show negligible discrepancies from the RD plots of the HM-13.0 test model. As shown in Table III , the following two conclusions can be made. Firstly, the coding losses notably differ for the various test sequences, and secondly, low bitrates are much more affected than higher ones. Both phenomena are related to the implications from the statistical analysis of Section VIII. Sequences (a) and (d) suffer the least coding losses, because their blurry areas are mostly coded using large PBs, which are predicted quite accurately by GPU intra. The other, more texture-rich videos, are overall more affected. This is due to the fact that their encoding requires many small PBs with higher statistical prediction error as shown above. Furthermore, it can be observed that on average, sequence (b) suffers from a smaller coding penalty than sequences (c,e,f). This indicates that clear directional patterns in sequence (b) can indeed be very accurately reproduced by the proposed GPU intra scheme. The results for sequence (e) show that despite its apparent directionality, the sequence is dominated by the high degree of detail of the individual images, e.g., the fine brickwork. Thus, the bitrate increases are similar to the ones measured in sequences (c) and (e).
Among all sequences, there is a very notable difference between %Bit low and %Bit high, which reaches over 3% in some cases. This is caused by POS, which understandably produces more accurate results in high bitrate ranges. This is due to the fact that the original and the reconstructed samples are very similar in such ranges. While the overall coding losses, given by %Bit, are still acceptable in many of the test cases, the strength of the GPU intra scheme and architecture clearly lies in the coding of high-quality video with QPs of 27 or lower. It should be mentioned here, that %Bit high operates between QPs 27 and 22 [31] . In this range, the increase in bitrate is as low as 0.8% for some experiments, which is negligible considering the significant speed gains.
The BD-BR performance results achieved with the WPP-based intra GPU encoder are shown in Table IV .
Overall, the characteristics of the data are very similar; however, the coding losses are slightly higher on average. This can be attributed to the less efficient entropy coding associated with WPP. The reduction in efficiency is caused by the CABAC state propagation between subsequent CTU rows [28] . This in turn impacts on the GPU intra coding performance since it directly affects the cost calculation given by (2) . If the entropy coding is less effective, B mode contributes more significantly to the mode cost J mode . Thus, the alternative bit cost estimation employed by GPU intra potentially affects the mode decision more with WPP than it does in a traditional single-threaded encoder with full CABAC support. However, the losses are still acceptable especially for high bitrates, considering that the proposed WPP-activated GPU intra encoder is 86.59-94.82% faster than HM-13.0.
The closer analysis of the different GPU intra configurations (A-D) reveals a few more interesting aspects. The reasons why (A) is the fastest configuration and (D) the slowest have already been discussed in Subsection IX-D. However, the respective worst and best BD-BR performance is not achieved with (A) and (D) configurations but with (B) and (C) configurations instead. This is despite the fact that (B) and (D) configurations skip RMD and test all 35 modes with PRDO. Section VIII revealed that the RMD of GPU intra is highly accurate, while the results of the RDO stages diverge more from the reference encoder. This divergence is caused because the quality of PRDO is strongly affected by the prediction errors accumulated due to POS and the lack of RDOQ. In addition, the errors tend to accumulate further with each potentially loss-inducing processing step: reference sample derivation, filtering, prediction, post-processing, quantization, dequantization, and finally the cost determination. These circumstances seem to negatively affect the overall coding performance in the tested scenarios, even at high bitrates. It is thus not necessarily helpful to test all 35 modes on the GPU with PRDO for lossy coding environments. Configuration (C), which uses GPU intra with RMD in combination with two final candidate modes being processed by the CPU with RDOMD, is shown to provide excellent coding performance among all experiments. In addition, despite being the fastest configuration, setting (A) still vastly outperforms (B) and (D) in terms of coding efficiency. Based on these observations, it is therefore recommended to use the high-speed setting (A) for time-critical applications and the high-quality setting (C) for optimal coding performance. With these configurations, the penalties in the range %Bit high are sufficiently small, with only 2.11-4.16% and 0.80-2.34%, respectively.
For the purpose of further evaluation, the RD plots of the two recommended configurations (A) and (C) are illustrated in Fig. 6 . In correspondence with the BD-BR data, it can clearly be seen that the distances between the curves of all experiments get smaller with higher bitrates. This is particularly visible for sequences (b,c,e). For almost all experiments, the upper third of the plots of configuration (C) and HM-13.0 almost exactly overlap, thus showing the excellent RD performance of the presented prototype. The only exception is sequence (f), where GPU intra performs best with QP 27 instead of 22. Sequence (f) has almost no directional patterns but many high-frequency areas, where even slightly wrong estimations of the RD cost considerably affect the RD performance especially at high bitrates. It is furthermore interesting to observe that the RD curves corresponding to configurations (A) and (C) also have the tendency to converge with decreasing quantizers. This means that even the high-speed setting (A) is almost equally viable in terms of quality as the slightly slower configuration (C) for QPs ≤ 22.
Examining the graphs very closely reveals that the RD curves of configurations (A) and (C) compared with HM-13.0 are shifted more to the right rather than downwards. This means that the bitrate is slightly higher, while the PSNR is only minimally affected. It was explained earlier that POS performs very well in combination with low QPs. At the same time, GPU intra is able to accurately calculate the distortion based on SSE, while it determines the bit cost B mode differently. On average, B mode is underestimated because only the coefficients are taken into account and the header information as well as the coded bit flag are being ignored. The GPU intra encoder therefore has the tendency to favor mode decisions causing low distortion over ones resulting in low bit costs. Hence, the bitrate is increased while the quality remains mostly unchanged. This phenomenon is even further emphasized by the chroma channels. GPU intra solely operates on luma samples, but the encoder determines the chroma mode of a PB based on its respective luma mode [6] . Consequently, if the luma mode is selected favoring optimal distortion over low bit cost, chroma modes with lower distortion may be selected as well. The raw result data also directly reflect this, because GPU intra achieved a slightly higher PSNR U and PSNR V than HM-13.0, approximately 50% of the time. Interestingly enough, this was mostly apparent for the texturerich sequences (b,c,e,f). While the measured improvements are small, often below 0.050 dB, it is still worth observing that the proposed scheme can have a positive impact on the quality of the U and V channels.
In conclusion, the RD performance of GPU intra is superior for the targeted use case, which is the high-speed encoding of HD and UHD videos with high-quality settings.
F. Lossless Coding Performance
The previously described single-threaded experiments have additionally also been conducted with a lossless compression setting. To enable lossless coding, the following parameters have been set: TransquantBypassEnableFlag = 1, CUTransquantBypassFlagForce = 1, LoopFilterDisable = 1, SAO = 0. The same metrics for ICR and ETR were used and the relative difference in bitrate was computed as
where B GPU and B HM13 correspond to the bitrate measured with the proposed and the HM-13.0 reference encoder respectively. The full set of results is listed in Table V . The data reveals that both ICR and ETR are a little lower here compared with the lossy coding experiments. The most complex parts of the MD, namely transform and quantization, are not performed when lossless coding is used, which is why ICR is smaller here. ETR is lower as a direct consequence of that. In addition, all other encoder modules have to operate on much larger amounts of data, which further reduces the relative time reduction. Still, GPU intra manages to save up to 53.37% encoding time.
The bitrate increases are at the same time negligible. On average across all experiments, the bitrate only increases by 0.55%. With lossless coding, the POS creates 100% accurate predictions because the original and reconstructed samples are identical. Hence, the losses are solely caused by the CABAC-free cost functions. It is interesting to note that the configurations (B) and (D) perform much better here. In fact, the average bitrate increases for the different configurations can be summarized as A > B > C > D. This means that the PRDO scheme is very accurate in lossless coding, where POS causes no prediction errors. It is therefore indeed beneficial to test all 35 modes in a brute force fashion on the GPU here. The additional workload can be handled comfortably by the GPU due to the efficient implementation of PRDO and, at the same time, the additional burden on the CPU is minimal. It follows from these results that GPU intra is ideally suited for lossless coding applications.
X. CONCLUSION
In this work, a novel parallel HEVC intra prediction algorithm for heterogeneous CPU+GPU platforms has been presented. It is targeted towards HD and UHD video applications with high quality and low delay encoding requirements. The proposed architecture is compatible with traditional single-threaded encoders and can furthermore be utilized as an efficient extension to the Wavefront Parallel Processing (WPP) mechanism. Communication between the CPU and the GPU is kept to a minimum and virtually no synchronization overhead is introduced. The achieved maximum encoding time reductions are 64.52% for a single-threaded system and 94.82% in combination with WPP.
The algorithm utilizes Prediction based on Original Samples (POS) and parallel-friendly cost functions to conduct intra prediction on one or more GPUs in a highly-parallel manner based on a Parallel Rate Distortion Optimization (PRDO) scheme. The acquired mode costs can then be used by the CPU in various ways to significantly accelerate the intra mode decision process. Different settings for varying speed-and quality-requirements are proposed, which make GPU intra a valuable tool for many real-life applications.
The presented framework is furthermore evaluated based on a large set of experimental data, covering a variety of different test sequences and configurations. The impacts that POS and the proposed cost functions have on the various stages of the intra mode decision process are statistically analyzed and the framework's efficiency is furthermore assessed in terms of RD performance. For high bitrate ranges, respective rate increases of 2.11-4.26% and 0.80-2.34% are observed for the suggested high-speed and high-quality configurations, with almost unaffected video quality. The bitrate increases in lossless coding settings are as low as 0.09%.
To further explore the possibilities of the presented algorithm, it is aimed to address some of its limitations as part of future work. These include the combination with inter prediction, the optimization for low-quality requirements, and solutions for sequences with very fine image details. In addition, the parallel scalability can potentially also be improved to make the algorithm more suitable for even larger spatial resolutions and many-core CPUs.
In conclusion, the presented GPU-based parallel intra prediction framework can speed up HEVC encoding significantly. At the same time, a reasonable balance between bitrate and video quality is maintained for the specified high-fidelity HD and UHD use cases.
