Abstract-HEVC has emerged as the new video coding standard promising increased compression ratios compared to its predecessors. This performance improvement comes at a high computational cost. For this reason, HEVC offers three coarse grained parallelization potentials namely, wave front, slices and tiles. In this paper we focus on tile parallelism which is a relatively new concept with its effects not yet fully explored. Particularly, we investigate the problem of partitioning a frame into tiles so that in a resulting one on one tile-CPU core assignment the cores are load balanced, thus, maximum speedup can be achieved. We propose various heuristics for the problem with a focus on low delay coding and evaluate them against state of the art approaches. Results demonstrate that particular heuristic combinations clearly outperform their counterparts in the literature.
INTRODUCTION
Tiles were introduced in HEVC [1] as an alternative way to achieve coarse grained parallelism that didn't previously exist in H.264/AVC [2] . Under tile partitioning a frame is split into M horizontal zones, each comprised of consecutive full rows of CTUs (Coding Tree Units) and N vertical zones each comprised of a number of consecutive full CTU columns. The intersections of the M horizontal zones with the N vertical ones form the M×N tile partitioning of the frame. An example is given in Fig. 1 .
Dependencies are broken on tile boundaries, allowing each tile to be encoded separately from the rest and creating the potential for parallelization. Since all tiles of a frame must be encoded before proceeding with the following frame, it is straightforward that the frame encoding time will be dictated by the slowest tile encoding task. Of particular interest is the case where enough CPU cores exist to perform a one on one tile-core assignment. In this case balancing core load is equivalent to balancing the size (in computational complexity terms) of the tiles. This process can be split into two parts. The first one is to estimate the CTU computational cost based on the information available, while the second one is to define the required M×N tile partitioning so that the maximum tile cost, defined by aggregating the costs of the corresponding CTUs is minimized.
In this paper we propose heuristic partitioning scheme and evaluate its combinations with various CTU cost estimators as well as other approaches for tile partitioning, existing in the literature. Our contributions include:
• we propose a heuristic (Iterative Optimal 1D
Partitioning, IOP for short), which invokes a 1D optimal algorithm initially proposed for the array partitioning problem [3] in the theory field and adapt it so as to tackle the related 2D problem in a practical fast manner for HEVC encoding;
• we evaluate the most promising heuristic combinations for a number of commonly used test sequences and compare them against the state of the art solutions.
Results indicate that in Low Delay coding, the IOP partitioning heuristic coupled with the estimator that uses CTU time together with GOP structure, outperforms its counterparts in the related literature, with time improvement approaching on average +0.9 speedup with 12 cores. The rest of the paper is organized as follows. Section II discusses the related work. Section III illustrates the partitioning algorithm and the cost estimation heuristics. Performance evaluation is done in Section IV. Finally, Section V concludes the paper.
II. RELATED WORK
Video coding parallelization has attracted much interest in the past. Works in the area include both fine and coarse grained approaches.
Fine grained methods usually rely on applying SIMD (Single Instruction Multiple Data) optimizations at one or more stages of the codec. Parallelizing DCT with SIMD instructions was one of the targets of [4] . In [5] a combined CPU-GPUs approach for parallel motion estimation is presented, while [6] includes a comparative study between motion estimation parallelism using CUDA cores, MPI and OpenMP. Such works are rather orthogonal to ours, since they tackle parallelism at a level lower than tiles, thus, can be (in principle) incorporated into tile parallelism approaches.
The coarse-grained category includes parallelism at the wavefront level, at slice and tile levels. In wavefront [7] each thread is assigned a CTU row and can commence the encoding of the corresponding CTUs once the first two CTUs of the previous CTU row are encoded. Slice parallelism also existed in H.264/AVC and has attracted significant interest. In [8] Macroblock assignment to slices was done based on the weighted past average of Macroblock coding times. In [9] the problem of balancing slices was tackled by introducing more slices than the number of available cores. In HEVC, the authors in [10] evaluated slice parallelism using fixed slices under various encoding scenarios, while in [4] the CTU cost estimation used for adapting slice size, was based on weighting upon the depth of each CU comprising the CTU. Last, in [11] the GOP structure for LD setting was used to estimate CTU cost. Although these works are not directly applicable for tile partitioning, we evaluate the IOP heuristic proposed in this paper using as CTU cost estimators the ones proposed in [4] , [8] and [11] .
More closely related are the works on tile parallelism in HEVC. In [12] the merits of tiles in HEVC are discussed focusing on evaluation using uniform static tile partitions. Another work presenting results for tile parallelization but for the case of intra encoding is [13] . There too, only fixed uniformly sized tiles are considered. The motivation for the tile partitioning algorithm in [14] is to use more tiles compared to the available cores in order to facilitate load balancing. The method is based on deriving a static tile partition based (among others) on pixel variance and the required throughput. Tiles are then assigned to cores using a bin packing technique. Since it is well documented, e.g., [15] , that increasing the number of tiles has a negative quality effect on compression (albeit smaller than slices), we followed an alternative path whereby there was one on one correspondence between tiles and CPU cores. In [15] an adaptive content tile partitioning algorithm is proposed. The size of tiles is decided so as to reduce the losses in coding efficiency generated by the use of tiles. Instead, we focus on improving the encoding time by reducing tile load imbalances, thus, this work is orthogonal to ours.
Perhaps, the closest work is [16] where the authors advocate the use of a similar to [4] CTU weighting scheme for tile partitioning. In the paper we compare the performance of this estimator, both with the partitioning algorithm used in [4] and with the newly proposed one here. Results demonstrate that the estimator we propose that is based on CTU coding time and GOP structure is superior.
III. HEURISTICS
In this paper we consider the problem of defining an M×N tile partitioning such that the maximum tile cost is minimized. Both M and N are assumed to be known and remain unchanged throughout the encoding process. As already motivated the problem consists of two sub-problems. The first one is to estimate the CTU costs while the second is to define the partitioning given the estimated costs. In the sequel we present the heuristics evaluated in the paper.
A. CTU Cost Minimization
Before proceeding with the presentation of the heuristics, we first introduce some notation. We denote by tij the encoding time of the j th CTU in the i th frame and wij the cost that an estimator predicted before the encoding of the i th frame.
Previous frame heuristic (PF)
. PF estimates wij as being equal to t(i-1)j, i.e., it uses the coding time the CTU exhibited in the last coded frame to estimate its cost in the following frame. The idea is similar to the one presented for slice parallelism in [8] .
Weighted CU depth (Weight). The algorithm proposed in [4] and [16] is based on assigning a weight cost on every CU depending on whether in the previous frame it was encoded as Skip, Inter or Intra and its corresponding depth in the quadtree. Table I reproduces the weight matrix for convenience. The heuristic then calculates CTU weight as the summation of the CU weights it consists of.
Low Delay estimator (LDE).
One of the common test conditions is LD (Low Delay) which uses a hierarchical GOP structure. In all the experiments of the paper we used the default configuration for hierarchical P frames in the reference software HM 16.7 which is also depicted in Fig. 2 . The intuition behind LDE is that the time complexity of frames belonging to the base layer such as P4 and P8 in Fig. 2 will be better predicted by the preceding frame of the base layer rather than the previous frame number wise. In the example, this means that P8 will be estimated using P4 rather than P7. Another change of LDE versus PF, concerns the estimation of the frame that immediately follows a base layer frame. Instead of using the base layer frame, it uses the frame immediately preceding it. For instance the estimation of P9 (not shown in Fig. 2 ) will be done using the CTU times exhibited in P7 instead of P8 as PF would have done.
B. Tile Partitioning
Having defined the estimated costs of the CTUs in the frame that is to be encoded, the tile partitioning algorithm attempts to define an optimal M×N tile partitioning such that the maximum tile cost is minimized. It is worth noting that the problem has been tackled in the theory area, under the context of array partitioning [3] , which was shown to be NP-hard. Nevertheless, the adoption in video coding of heuristics from the theory area was not performed yet (to the best of our knowledge). This is presumably due to the fact that the approximation schemes proposed are rather complex in nature, thus, might compromise running time. Here, we follow an alternative approach whereby we adopt a fast 1D optimal scheme to run for the 2D tile case.
The 1D optimal algorithm. Assume we have an array of size S which must be split into M bins so that the maximum cost of a bin, defined as the aggregate cost of its cells, is minimized. This problem can be solved to optimality by the algorithm that follows.
Given a specific bin size B, it is identified whether all array elements can fit into M bins in a consecutive manner, or not. For instance, assuming the following array  A={10,12,15,5,8}, M=3 and B=20, it is clear that bin1={10}, bin2={12}, bin3={15,5}, leaving the last array element unassigned (assigning it to bin1 and bin2 is illegal, while in bin3 capacity violation occurs). We will refer to such bin assignments that leave some array elements unassigned as infeasible.
Let C be the total cost of all elements in the array. Clearly, the optimal maximum bin cost can't be lower than: = / . It also can't be larger than C. It is straightforward to observe that given the range [Bmin, C] the optimal bin size is the first integer in this range that produces a feasible bin assignment. More accurately: ≤ ≤ , ∀ < assignments are infeasible and ∀ ≥ assignments are feasible. In order to identify Bopt binary search can be used. To reduce the number of iterations, instead of searching in the range [Bmin, C], the algorithm first checks for feasibility B=Bmin (in case C/M isn't an integer the assignment can be identified as infeasible without checking bin assignments). If the assignment is feasible, Bopt=Bmin and the algorithm terminates. Otherwise, it checks for the feasibility of B=2Bmin. If 2Bmin also results in infeasible placement it checks for 2(2Bmin) and so on so for until a feasible assignment is identified for B=2 k Bmin. Then a binary search approach is used to identify Bopt in the range of (Bmin, 2
k Bmin]. We should notice that in almost all cases at the experiments k=1. 2D application. Assume M×N is the required tile split. The role of the 2D array elements is taken by the CTU cost estimations. The algorithm proposed in this paper (Iterative Optimal 1D Partitioning -IOP) performs the following steps. First, it defines the horizontal split into M zones by running the previously described 1D algorithm, over a 1D array of size equaling frame height in CTUs and with the cost of each element derived by summing up the costs of the CTUs in the corresponding CTU row of the frame. Next, it defines the vertical split into N zones using the 1D algorithm, over an array of size equaling frame width in CTUs and with the cost of each element derived by summing up the costs of the CTUs in the corresponding CTU column of the frame.
The intersection of the M and N zones, form the initial tile partitioning. The process is illustrated in Fig. 3 (a) and (b) . For instance, defining the vertical split into N=3 zones, is equivalent to solving the 1D problem using 3 bins over the array A={75, 89, 97, 109, 125, 137}, which gives B opt =261 and the corresponding vertical cut depicted in Fig. 3 (b) .
The derived partitioning is further improved in an iterative manner as follows. First, the vertical split is optimized using the horizontal split previously derived (Fig. 3 (c) ). To do so, a variation of the 1D approach is used, whereby the participating bins equal the number of tiles (6 in the example). 15  20  15  35  15  25  20  35  40  26  51  40  15  22  24  18  31  37  25  12  18  30  28  35   15  20  15  35  15  25  20  35  40  26  51  40  15  22  24  18  31  37  25  12  18  30  28  35   15  20  15  35  15  25  20  35  40  26  51  40  15  22  24  18  31  37  25  12  18  30  28  35   15  20  15  35  15  25  20  35  40  26  51  40  15  22  24  18  31  37  25  12  18  30  28  35 In calculating feasibility for a particular bin size B, columns are added one by one, up to the point where adding a column would violate the capacity of one of the M bins as defined by the horizontal partition (M=2 in the example). The rest of the algorithm remains the same, i.e., it searches for the first feasible assignment using binary search in the range [Bmin, 2
k Bmin] with = / . So for instance, having defined the horizontal partitioning of the solid line in Fig. 3 (c) , the vertical repartitioning is done as follows (assuming T00 is the top leftmost tile and W00 the total cost of its elements). The total cost of the elements is C=632, Bmin=105 and since C/MN isn't an integer it leads to an infeasible assignment. Then the algorithm checks for B=210. The first column is added giving W00=35, W10=40 (the cost of other tiles is 0). Since both are less than 210, the second column is checked. Adding it, results to: W00=90 and W10=74. Then the third column gives: W00=145, W10=116 and the fourth column results to: W00=206, W10=164. Adding the fifth column would result in capacity violation, therefore the first two tiles are defined as having a width of 4columns. The last two columns will be assigned to T01 and T11 and all the elements of the array will be assigned using the 6 available bins/tiles (in fact just 4). Since the assignment is feasible, in a binary search manner B=152 will be checked next and so on so for, until Bopt=131 is defined, leading to the partition shown with dashed lines in Fig. 3 (c) .
Having defined a new vertical split, the algorithm does the same with the horizontal partitioning as shown in Fig. 3(d) . The algorithm then proceeds in an iterative manner whereby at each iteration, both vertical and horizontal split improvements are considered. It terminates at the first iteration that couldn't improve the partitioning.
IV. EXPERIMENTS
We implemented the tile parallelism heuristics using the HM 16.7 reference software and OpenMP. We conducted experiments on a Linux server with two 6-core Intel Xeon E5-2630 CPUs running at 2.3GHz using hyper threading. We used Class A and B test sequences together with Bosphorus which is 4K. The characteristics of the sequences are summarized in Table II . All results were obtained assuming the LD scenario with an initial I frame followed by P frames and a GOP size of 4 with the structure shown in Fig. 2 was set to 32, bit depth was 8, CTU size 64×64, max depth for partitioning was set to 4 and search mode to TZ.
We experimented with three different tile numbers (in one slice): 4 (2×2), 8 (4×2) and 12 (4×3). Each tile was assigned a separate CPU core on a one on one basis. We compared the combination of IOP together with the PF, Weight and LDE estimators versus a static approach that assigns tile sizes in a uniform manner and doesn't change it (Static) and the algorithm of [16] . In all cases we measured the achievable speedup versus a sequential execution. Table III shows the achievable speedups for each sequence. Bold values represent the algorithmic winner in this particular setting. Notice, that IOP-LDE and IOP-PF are winners over the rest, with the first combination achieving slightly better overall performance compared to the second one, as depicted more clearly in Table  IV that illustrates the average performance in speedup terms over all test sequences.
Summarizing the results, the achievable speedup of IOP, make it a clear winner against Static and the method presented in [16] , with speedup differences of roughly 0.5 in the 8 core case and 0.9 in the 12 core scenario. Among its variants the combination with LDE achieves the best performance with the PF following closely. Before ending the section we would like to mention that IOP itself contributed a negligible overhead, hence, the improved speedup achieved in the coding process. This was due to the fact that after the initial partitioning, the number of iterations performed by IOP was never more than four (in 36.24% of the cases it was one and in 60.84% two).
V. CONCLUSIONS
In this paper we considered the problem of defining tile partitioning such that the threads tasked with the encoding of each tile are load balanced. We proposed a partitioning heuristic called IOP and evaluated it with different CTU cost estimators. IOP-LDE and IOP-PF were found to outperform both the static uniform approach and another alternative from the literature. 
