Abstract Low-power asymmetric multicore processors (AMPs) have attracted considerable attention due to their appealing performance/power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important application that can benefit from an implementation tailored for the low-power AMPs present in many current mobile or handheld devices. In this scenario, we present an architecture-aware implementation of an HEVC decoder that embeds a criticality-aware scheduling strategy tuned for a Samsung Exynos 5422 System-on-Chip furnished with an ARM big.LITTLE AMP. The performance and energy efficiency of our solution are further enhanced by exploiting the NEON vector engine available in the ARM big.LITTLE architecture. Our experimental results expose a 1080p real-time HEVC decoding at 24 frames/s and a reduction of energy consumption over 20 %.
Introduction
High efficiency video coding (HEVC) [6] is the successor of the most widely used video standard, H.264/Advance Video Coding (AVC) [20] and, therefore, a serious candidate to become the state-of-the-art tool for video compression in the near future.
One crucial requirement for video compression, which shaped the HEVC standard, is adaptability, especially for practical consumer electronics applications. In particular, video content should be preferably distributed in a format that is in accordance with the display and memory capabilities, processing power, and computational constraints of consumer electronics appliances as well as with the network bandwidth.
The Joint Collaborative Team on Video Coding (JCT-VC), which includes video experts from both the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Expert Group (MPEG) and the International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) Video Coding Expert Group (VCEG), designed the new HEVC standard taking into account these requirements. Concretely, HEVC was conceived to deliver high-quality multimedia services, with reasonable efficiency and performance over bandwidth-constrained networks. In addition, a primary goal during the standardization process was to minimize the computational requirements and energy consumption [36] . Although several coding tools were not included in HEVC, due to their complexity, the computing and energy demands of HEVC have considerably increased [33] in comparison with its predecessor H.264/ AVC [20] .
Asymmetric multicore processors (AMPs) have been recently proposed for severely energy-constrained environments, especially for mobile appliances, where heterogeneity in applications is mainstream. These architectures integrate two (or more) types of cores with different capabilities, which share the instruction set architecture (ISA) but differ in their microarchitecture. A practical example is the ARM big.LITTLE AMP included in Samsung's Exynos 5422 System-on-Chip (SoC), which features both ARM Cortex-A15 and Cortex-A7 cores. The former type of core delivers higher performance than the Cortex-A7 counterpart, at the expense of higher power dissipation rate, while the Cortex-A7 cores can potentially deliver a more favorable performance/power ratio.
In this paper, we integrate an asymmetry-and criticalityaware scheduling strategy into the multithreaded implementation of the libde265 library [15] specifically adapted for the Exynos 5422 SoC. We target the wavefront parallel processing (WPP) [19] scheme defined for HEVC, in which multiple ''regions'' of a single frame can be processed simultaneously. Compared with other alternatives, such as the tiling and slicing approaches, WPP does not limit intraprediction nor resets context-adaptive binary arithmetic coding (CABAC) probabilities, avoiding a reduction of the encoding efficiency. However, as it is shown in this paper, this parallelization strategy possesses a considerable challenge for an AMP, due to the dependencies when reconstructing the Coding Tree Unit (CTU) rows.
To exploit the asymmetric hardware concurrency, our parallel solution migrates the threads taking into account the dependencies of the tasks in execution, so that those tasks with higher priority are always executed in the fast (big) ARM Cortex-A15 cores, while the non-priority tasks run on the slow (LITTLE) Cortex-A7 cores. Thus, the key to this strategy is a policy that moves tasks between fast and slow cores on-the-fly, as the tasks' priorities vary. The implementation is further enhanced by including ARMspecific NEON optimizations [2] into certain modules of the library, a capability that is not available in the reference HEVC library. The experimental evaluation of our proposal on an ODROID-XU3 board [18] , furnished with an Exynos 522 SoC, demonstrates the advantage of the new decoder in terms of decoded frames per second (FPS) and also from the perspective of energy efficiency.
The rest of the paper is structured as follows. In Sect. 2, we briefly describe the HEVC standard and discuss some related works. In Sect. 3, we experimentally illustrate the poor (computational) performance and energy efficiency of the reference multithreaded implementation of the libde265 library on the target ODROID-XU3 board/Exynos 5422 SoC. In Sect. 4, we introduce our strategies to adapt the original libde265 implementation to the asymmetric ARM big.LIT-TLE architecture; in Sect. 5 we report the performance and energy efficiency results of the new solution. Finally, Sect. 6 closes the paper with a few concluding remarks. [6] introduces new coding tools with respect to its predecessor H.264/AVC, as well as improves upon alternative previous encoders. The HEVC standard can reach the same subjective video quality than its predecessor by using half the bit rate [29, 36] , while notably increasing coding efficiency. Moreover, HEVC is designed to support formats beyond high-definition (HD) resolution.
One important change in HEVC affects image (frame) partitioning. The new standard introduces three new concepts: coding unit (CU), prediction unit (PU), and transform unit (TU). A frame is partitioned into CTUs, each containing a single luma Coding Tree Block (CTB) and two chroma CTB blocks. Each CTU is further partitioned into several square regions, of variable size, called CUs. Each CU, consisting of 8 Â 8 to 64 Â 64 pixels, may contain one or several PUs and TUs. This partitioning can be performed within each subarea recursively, until it has a size of 8 Â 8 pixels; see Fig. 1 . At PU level, intra-and inter-predictions are carried out with sizes ranging from 4 Â 4 to 64 Â 64 pixels. The CUs can be further split into TUs, which contain the coefficients for transformation and quantization in the form of square blocks of pixels. This structure leads to a more flexible coding that suits the particularities of the frame.
Tiles [16] and WPP are two new technologies introduced in the HEVC standard to support high-level parallelism, in both cases by allowing the division of frames for parallel decoding. Additionally, HEVC inherits the slice concept from previous standards. When working by tiles, a frame can be divided into rectangular groups of CTUs, which are treated as independent decoding tasks. Alternatively, with WPP, CTU rows can be decoded in parallel though, due to data dependencies, the decoding of a CTU row must proceed with a delay of two CTUs with respect to that of the previous CTU row. This yields a wavefront parallel processing scheme that names this approach.
Related work
In the literature we can find general strategies to accelerate HEVC video decoders, as well as proposals for energyconstrained devices which pursue similar goals to those set for our work.
In [8] , the authors use single-instruction multiple-data (SIMD) instructions to accelerate all major modules of an HEVC decoder, obtaining speedups of up to 5Â on mobile and desktop platforms, to deliver 1080p real-time decoding. Similarly, the authors of [3, 14, 26] report 720p realtime decoding on an ARM Cortex-A9 platform and 1080p real-time decoding on an Intel platform. Although all these works leverage SIMD optimizations on energy-constrained devices, none of them analyzes the impact of this type of optimizations in terms of energy consumption. In [11] the authors introduce a methodology to deal with the dependencies intrinsic to WPP. Concretely, they adapt the WPP approach to process several partitions as well as several pictures in parallel. This presents the advantage of maintaining the amount of (thread level) parallelism during the execution. This approach is further optimized in [10] and compared with the other high-level parallelization strategies defined in the HEVC standard.
Hardware implementations of the HEVC decoder are also a hot research topic. In [38] , a 40-nm CMOS hardware decoder is introduced that can process 4k videos in real time. In [23] a more energy-efficient 28-nm CMOS hardware decoder is presented. In addition, there are also proposals that focus on a concrete module of the HEVC decoder such as [12, 24, 43] .
In the context of mobile and handheld devices, the authors of [30, 31] describe several strategies for power optimization of a real-time software HEVC decoder on a NEON architecture. These strategies exploit data parallelism, task parallelism, and dynamic voltage-frequency scaling (DVFS); however, only one type of core in an ARM big.LITTLE AMP is used at a time. In [27, 28] , the authors reduce the filtering complexity to diminish energy consumption at the expense of a significant degradation in the subjective video quality. Energy reductions of up to 28 % are reported there for an ARM bit.LITTLE core, but the authors only use one type of core. Some complementary work also aims to accelerate a concrete module on these energy-constrained devices [41] .
A research topic of interest is scheduling for heterogeneous/asymmetric processors. In [40] the authors propose a scheduling strategy based on a performance impact estimation (PIE) which is able to predict a workload-to-core mapping. On a simulator that mimics the hardware characteristics of a big.LITTLE board, the authors are able to improve the performance by 8.7 % over the base scheduling policy. Moreover, two techniques are presented in [39] to simultaneously meet high performance and fast scheduling time. The first one selects the task with the highest upward rank value at different levels of the application and assigns the selected task to the processor, while the second uses the summation of upward and downward rank values for prioritizing tasks. Other relevant scheduling strategies tested by means of simulators can be found in [7, 25] , while in other works the asymmetry of the processor is enforced by changing the core operating frequencies [25, 34] .
Focusing on scheduling strategies specifically designed for ARM big.LITTLE processors, two main variants can be found in the literature: (1) proposals of scheduling strategies that optimize energy efficiency [13, 35] and (2) proposals which, depending on the requirements of the application, bound the workload to either the big or the LITTLE clusters on boards where only one of them can be active at a given time [17, 44] .
Additionally, we can find some efforts to characterize the execution time and energy consumption of an HEVC decoder, for example complexity-related aspects in the standardization process [5] ; energy consumption of multicore CPUs and hardware-accelerated decoders [4] ; exploiting race to idle and slack via DFVS for energy efficiency [9] .
3 Multithreaded implementation of the HEVC decoder This reference library follows a master-slave approach where a single master thread creates and enqueues tasks, while several (worker) threads process these tasks. This master thread is only responsible for creating the tasks, but does not interact with the working threads otherwise. The master thread has the highest priority since the tasks should be created as soon as possible in order to continue the decoding process. Given a frame, in WPP the master thread initially creates one task per CTU row and keeps track of task dependencies to further create more tasks. The tasks are responsible for keeping track of the internal dependencies when reconstructing a given CTU row. In particular, if we number the CTUs in the frame from the top leftcorner to the bottom-right one starting at (0, 0), where the first index denotes the CTU row, then CTU (i, j) cannot be reconstructed until CTU ði À 1; j þ 2Þ has been completely processed, since in WPP mode, each CTU row is processed relative to its preceding CTU row with delay of two consecutive CTUs. Proceeding in this manner, no dependencies between consecutive CTU rows are broken at the partition boundaries [37] . (This constraint does not apply when the parallelization approach is based on slices or tiles.) Once all CTU rows have been reconstructed (by the worker threads), in a second step, the tasks corresponding to the Deblocking and Sample Adaptive Offset (SAO) filters are generated, and these filers are applied. The library generates three tasks per CTU row: one to filter the vertical edges, one to filter the horizontal edges, and one for the SAO filter. In this second step, a synchronization of tasks is always needed since, in order to perform the vertical filtering of a concrete CTU, up to three CTUs must be previously horizontally filtered. The same applies to the SAO filter.
Experimental setup
The ODROID-XU3 board employed in our experiments comprises a Samsung Exynos 5422 SoC with an ARM Cortex-A15 quad-core processing cluster running at 2.0 GHz and a Cortex-A7 quad-core processing cluster at 1.4 GHz. Both clusters access a shared DDR3 RAM (2 Gbytes) via 128-bit coherent bus interfaces. Each ARM core (either a Cortex-A15 or a Cortex-A7) has a 32?32-Kbyte L1 (instruction?data) cache. In addition, the four ARM Cortex-A15 cores share a 2-Mbyte L2 cache, while the four ARM Cortex-A7 cores share a smaller 512-Kbyte L2 cache; see Fig. 2 .
The target platform runs an Ubuntu 14.04 LTS distribution with the Linux kernel 3.10. The library was compiled with -O3 optimization level of the g?? 4.8.2 compiler. We ensure that all cores run at their highest frequency during the experiments by setting the appropriate frequency limits in the Linux performance governor.
The codes are instrumented with the pmlib [1] framework, which collects power dissipation samples corresponding to instantaneous power readings from four independent sensors in the board (for the Cortex-A7 cores, Cortex-A15 cores, DRAM DIMMs, and the MALI graphics processing unit (GPU)), with a sampling rate of 250 ms. The power readings from all four sensors are added to estimate the instantaneous total power dissipation, and this collection of values is averaged and multiplied by the execution time to obtain the energy consumption.
Performance of the multithreaded libde265 on the ODROID-XU3
To characterize the performance of the libde265 library, the developers show that, when decoding a 1080p HEVC bit stream via WPP multithreading, on average the gstreamer plug-in is able to process 150 FPS on a server equipped with a recent Intel desktop quad-core CPU. However, as it is shown next, these decoding ratios are much lower for the ODROID-XU3 board. The test set for our experiments includes the five 1080p videos from the JCT-VC benchmark [21] ; hereafter, we report average results for all these video sequences. These videos were previously encoded with the HM-16.2 reference encoder [22] using four quantization parameter (QP) points (22, 27, 32, and 37) . The HM encoding parameters were those set by default in the random-access main configuration, except in that the WPP option is enabled.
The top plot in Fig. 3 shows the FPS (averaged for all five 1080p videos from the JCT-VC test set) on the ODROID-XU3. The bottom plot in the same figure shows the (average) energy (in Joules) per frame (EPF) consumed during these experiments. The first aspect to observe is that the use of more than four threads produces an unexpected decrease of the FPS. The reason for this behavior is the operating system (OS) scheduler implemented in Linux for the ODROID-XU3 board, named as heterogeneous multiprocessing (HMP) scheduling [42] . This scheduler proceeds by monitoring the behavior of the application during a period of time. Then, based on that information, on the next cycle it decides whether the application has to be executed on a big or a LITTLE core. In our experiments, we observed that the scheduler simply schedules all the threads to the Cortex-A15 cores. In consequence, when more than four threads are spawned, some Cortex-A15 cores interleave the execution of two threads and, for this particular application, a reduction in the FPS rate is observed. An additional aspect to note is that the FPS does not scale linearly with the number of cores, due to the dependencies when decoding the CTU rows. Concretely, as the videos are encoded to support WPP, the decoding of a CTU row cannot start till a minimum of two CTUs has been completely reconstructed for the previous CTU row. As a result, at the beginning of decoding each frame, all threads except for one will have to wait before they can proceed to reconstruct their assigned CTU rows. To further expose the effect of WPP decoding on a multithreaded execution, let us assume that the decoder spawns four (worker) threads. This means that the decoding process of the fourth CTU row cannot start until a minimum of six CTUs of the first CTU row has been reconstructed, which represents one-fifth of the horizontal resolution for a 1080p video. Moreover, when eight threads are spawned, the reconstruction of the eighth CTU row cannot begin until a minimum of fourteen CTUs of the first CTU row has been reconstructed, which is almost half of the horizontal resolution of a 1080p video. In summary, with this approach, the level of concurrency is linearly reduced as the multithreading factor grows. This explains why, in our experiments with QP = 37, two threads deliver a speedup of 1.82 over the sequential execution, but four threads offer a meager speedup of 3.02.
Regarding the energy efficiency, using more than four threads does not increase significantly the energy efficiency of the HEVC decoder. In addition, four threads offer better energy efficiency than one, but the effect is less visible due to the sublinear scalability of the HEVC decoder.
Static manual thread mapping on the Exynos 5422 SoC
Our initial attempt to exploit the computational resources of the Exynos 5422 SoC more efficiently manually binds the (worker) threads to concrete cores, populating first each Cortex-A15 core with a single thread, and from then on mapping threads to the Cortex-A7 cores. The two plots in Fig. 4 display the average FPS and EPF (top and bottom, respectively) attained with the static manual binding (lines labeled as A15 ? A7). Additionally, for comparison purposes, we include the performance lines when the execution is limited to the Cortex-A7 cores (label A7), as well as the lines of the ''Unmodified'' version of the reference library that only uses the Cortex-A15 cores (see previous subsection). In all cases, for clarity, we only include the lines for a concrete QP, but the qualitative conclusions that can be extracted from all other QPs are similar.
Let us focus on the FPS first. As expected, the results obtained with the static manual binding upon spawning four or less threads reveal performance rates that are very close to those observed with the unmodified version of the library. The reason is that, in both configurations, we are only using the Cortex-A15 cores (with static manual binding, threads are first mapped to this type of core; with the unmodified configuration, threads are only mapped to them). However, when five threads are used, the performance attained with the static manual binding is significantly reduced, steadily growing from that point with the number of threads but remaining below the performance obtained with the unmodified version of the library. The source of this behavior is that, due to the dependencies during the reconstruction of the CTU rows, the Cortex-A7 cores slow down the threads running on the Cortex-A15 cores, so that the latter basically proceed at the speed of the former. To illustrate this, note that the static manual binding scheme, operating with the full 8 cores, delivers around 18.5 FPS, which is slightly below twice the FPS rate attained with 4 Cortex-A7 cores (around 10.1 FPS). From the point of view of EPF, the static manual binding delivers higher energy efficiency than the unmodified version when five or more threads are spawned. This is explained in Fig. 5 , which displays the power dissipation (in Watts) of the different configurations. There, we can observe the large differences in power consumption when only the Cortex-A7 cores are used (around 1.5 W) and when the unmodified version is used (around 5.5 W). We can also note that, when we exploit both types of cores simultaneously via static manual binding, the average power draft decreases. In this scenario, the Cortex-A7 cores, which dissipate significantly less power than the Cortex-A15 cores, are at full load, but the Cortex-A15 cores are not, reducing the overall power rate. A decrease of about 25-30 % in power dissipation is delivered when both types of cores are used simultaneously in comparison with the unmodified version, which is more than the FPS decay, which ranges from 15 to 28 %. The effect on the energy consumption is explained by the linear dependence of this metric on the product of time and power.
In conclusion, these experiments naturally motivate the need for an architecture-aware alternative to the original multithreaded implementation that is able to efficiently exploit the asymmetric resources of the ARM big.LITTLE architecture (or any other AMP). Ideally, an appropriate scheduling mechanism that exploits both the Cortex-A15 and Cortex-A7 cores should render two positive effects:
• An increment of FPS. To satisfy this, the Cortex-A7 cores should not slow down the execution of the Cortex-A15 cores but contribute to the global (combined) decoding rate.
• A decay of EPF. The Cortex-A7 cores are more energy efficient than its Cortex-A15 counterparts, so by including the former we should save energy.
4 Architecture-aware optimization of libde265 on the ARM big.LITTLE AMP
The default approach adopted by libde265 on a multithreaded CPU presents three major drawbacks when WPP is applied to simultaneously leverage all the cores of an ARM big.LITTLE AMP:
• Due to the computing requirements of the libde265 library and the scheduling policy of the ODROID-XU3 board (HMP scheduling), the OS scheduler does not map threads to the LITTLE cores. • When forcing the scheduler to use the Cortex-A7 cores, the dependencies intrinsic to WPP do not allow to exploit the full potential of the Cortex-A15 cores, decreasing the overall FPS.
• For WPP, the performance does not scale linearly with the number of threads, which directly affects the energy efficiency of the solution. This issue is not specific of an AMP but rather affects to any multithreaded CPU. There exist several prior strategies that aim to mitigate this effect [10, 11] .
In the remainder of this section, we address the first two issues in order to improve the performance of the libde265 library on the Exynos 5422 SoC. We do not consider the third issue as it has been previously tackled by others in order to improve the decoder's scalability, by starting with a new frame before the previous one has been completely reconstructed [10, 11] . We emphasize that the principles of the scheduling algorithm presented in our work are orthogonal to these frame-asynchronous strategies, and our approach can directly accommodate these strategies as the thread dependencies remain in place. This feature is very useful when, for a given frame, there are less unfinished CTU rows than idle CPU cores in the platform.
Asymmetry-and criticality-aware scheduling on the Exynos 5422 SoC
Taking into account the results in Sects. 3.2 and 3.3, we have introduced significant modifications in the original multithreaded implementation of the libde265 library in order to migrate the threads between the two types of cores in the Exynos 5422 SoC at execution time. Note that the optimizations exposed in this section are only beneficial when the execution proceeds on a number of threads that exceeds the amount of Cortex-A15 cores, as otherwise our solution only exploits the more powerful Cortex-A15. The fundamental step toward delivering high performance and energy efficiency on the target Exynos 5422 SoC is to carry out an appropriate dynamic binding of threads to cores. Specifically, the master thread is always bound to a Cortex-A15 core since it is in charge of generating the tasks, and these must be available as soon as possible. In addition, the mapping of worker threads to cores is crucial to fully exploit the capabilities of the ARM big.LITTLE processor. For simplicity, let us assume an execution with 8 worker threads. Then, at the beginning of decoding each frame, the four threads assigned to reconstruct the top four CTU rows are initially bound to Cortex-A15 cores, while the next four CTU rows are reconstructed by four threads that are bound to Cortex-A7 cores. This situation is graphically illustrated in Fig. 6 . This initial mapping of threads to cores does not differ from the static manual binding previously presented. The key to our new asymmetry-aware solution, however, is to control the migration of threads, in order to ensure that the Cortex-A15 remains in charge of the ''critical'' CTU rows.
Let us consider, for example, that 8 threads, denoted as T1-T8, are spawned, and they commence to process the top 8 CTU rows, with the i-th row assigned to thread Ti, the first four threads mapped to the four Cortex-A15 cores (big.{A,B,C,D}), and the next four to the four Cortex-A7 (LITTLE.{A,B,C,D}). Consider next that, starting from the initial scenario, thread T1, which processes the first CTU row on big.A, completes this task. In this situation, thread T5, in charge of the fifth CTU row on LITTLE.A, is migrated to big.A, where it continues processing the same row. In addition, thread T1 is migrated to LITTLE.A, where it commences to process the 9th CTU row. As mentioned, the purpose here is to keep all 8 cores/threads busy with work, but to ensure that, from the 8 CTU rows that are (most of the time) on-the-fly, the top four are assigned to threads running on Cortex-A15 cores, while the remaining four are processed by threads mapped to the Cortex-A7. A graphical illustration of the migration is provided in Fig. 7 .
From the implementation point of view, the threads bound to the Cortex-A7 cores are responsible for checking whether any Cortex-A15 core becomes ''idle.'' When a task bound to a Cortex-A15 core is completed, the task itself changes the CPU affinity mask of the thread and marks the Cortex-A15 core as ''idle.'' When any of the tasks running on Cortex-A7 cores detects this ''idle'' Cortex-A15 core, it changes its CPU affinity mask, and it is migrated for execution on the ''idle'' Cortex-A15 core. The Cortex-A15 core is then marked again as ''busy.''
To avoid incurring an excessive overhead, this test is done every time a CTU is completely reconstructed. As a result, the migration of the threads from slow to fast cores is not immediate, but can be slightly delayed (note that a 1080p video contains 30 CTUs per CTU row that are processed in about 5-10 ms, meaning that the time between two checking points is considerably smaller than 1 ms). Due to this, a special situation may occur that has to be tackled with care. Assume that the first thread running on a Cortex-A7 that completes a CTU and, therefore, checks whether there are any Cortex-A15 cores available is T6, running on LITTLE.B. In this scenario, we only allow T6 to migrate to a vacant Cortex-A15 core if there are at least two of them idle as, otherwise, we would be giving higher priority to the reconstruction of the 6th CTU row over the 5th one. A graphical description of this mechanism can be found in Fig. 8 , where CTU rows 1 and 2 have been completely reconstructed (the working threads assigned to process them have been migrated to Cortex-A7 threads, and the Cortex-A15 cores are marked as ''idle''), and CTU rows 3 and 4 are in progress on two Cortex-A15 cores. In this situation, if T6 (CTU row number 6) is the first thread that checks for ''idle'' Cortex-A15 cores, it will be migrated if and only if there exist enough Cortex-A15 cores to process all preceding CTU rows. In Fig. 8 , T6 is migrated to a Cortex-A15 core since, at this point, there are two idle Cortex-A15 cores. In addition, T5 will be migrated to a Cortex-A15 core as soon as it finishes the next CTU. Similarly, we allow T7/T8 to migrate to Cortex-A15 cores if at least three or all of them are idle. With this strategy, we ensure that the threads responsible for reconstructing the four pending CTU rows with highest priority (i.e., those in the top) are processed in the Cortex-A15 cores, but we simultaneously allow that, when there are Cortex-A15 cores available, those threads processing CTU rows in the Cortex-A7 cores do not have to wait until the top one is migrated.
In the final part of the CTU row reconstruction, we ensure that the threads which process the bottom CTU rows of each frame are never migrated to the Cortex-A7 cores, but remain bound to Cortex-A15 cores. As a consequence, in order to start with the in-loop filters part, we have four threads bound to Cortex-A15 cores and four bound to Cortex-A7 cores. When there are less unfinished tasks than working threads, idle threads are blocked till they receive a signal which indicates that new tasks have been generated.
The multithreaded parallelization of the stage that applies the filters is easier as, in this case, there exist a larger volume of tasks and considerably higher level of concurrency (i.e., less dependencies) compared with the prior stage. In consequence, thread scheduling during the application of the in-loop filters is not so crucial/complex. The fundamental modification in the filter stage aims to keep the Cortex-A15 cores always busy with work. Thus, for example, when the number of tasks available for execution is lower than eight, the Cortex-A7 cores first become idle. This test is again carried out by the threads mapped to the Cortex-A7 cores whenever the filters are applied to a complete CTU. If an idle Cortex-A15 core is detected, the thread that found this situation is migrated there.
NEON intrinsics
The libde265 library integrates SSE optimizations for x86-based architectures. Concretely, several key functions for interpolation, motion compensation, and transformation modules of the library are implemented using SSE4 or SSE4.1 intrinsics. However, these instructions are not supported by ARM processors, as these systems have their own SIMD instruction set called NEON [2] . In consequence, in order to exploit the SIMD units in the ARM architecture, it is necessary to transform the SSE-based functions to their NEON-based counterparts.
For this purpose, we have mimicked the work in [32] , which proposes NEON equivalents for a limited subset of the SSE intrinsics. Following an analogous path, we have added new equivalents to the work in [32] in order to deliver NEON-equivalent intrinsics for all SSE intrinsics appearing in the libde265 library. At this point, we would like to emphasize that, when available, we use the equivalents proposed in [32] without modifications. For some of these added SSE intrinsics there exists a direct mapping with a concrete NEON-equivalent intrinsic; however, for some others, this mapping is not possible and the equivalents are implemented using several NEON intrinsics. Table 1 summarizes the NEON equivalents contributed in this work and reports the number of NEON intrinsics required to implement each SSE intrinsic. 
Performance evaluation
We next evaluate the performance of the asymmetry-aware version of the libde265 library, in terms of both FPS and EPF. To analyze the impact of the new scheduling strategy and the adoption of SIMD intrinsics, this section is divided into three parts. In the first subsection, we evaluate the asymmetry-aware mechanism that migrates threads to keep the Cortex-A15 busy most of the time; in the second subsection, we study the impact of NEON intrinsics; and in the final subsection, we summarize the results and analyze all the modifications jointly.
Asymmetry-and criticality-aware scheduling
The plots in Fig. 9 display the average FPS (top) and EFP (bottom) of the asymmetry-aware scheduling with thread migration described in Sect. 4.1 (lines labeled with Affinity). For comparison purposes, the plots also include the results attained by the unmodified version of the library, as well as those of the version where the threads are manually bound upon initialization to a concrete core but no migration is allowed (labeled as A15 ? A7).
Regarding the FPS rate, this evaluation shows significant benefits when the threads are migrated to keep the ''critical'' tasks (i.e., the top CTU rows) running on the Cortex-A15 cores. With this configuration, the Cortex-A7 cores do not slow down the Cortex-A15 cores but instead contribute to improve the FPS when the number of threads exceeds the amount of Cortex-A15 cores. We can still observe a nonlinear increment of the FPS as, due to the dependencies implicit in WPP, the threads sometimes have to synchronize before they can reconstruct the assigned CTU rows; in addition, there is a certain overhead due to the differences in the performance between both types of cores.
From the perspective of the EPF metric, the solution that integrates the enhancements presented in Sect. 4.1 outperforms the unmodified version of the library but, unfortunately, it is still less energy efficient than the approach that statically binds the threads to a specific core (no thread migration). The reason of this behavior is motivated in Fig. 10 , which shows that the power dissipation grows with the number of threads. In the version enhanced with thread migration, all threads/cores are at full load (or close to it), and the power draft augments more rapidly with the number of threads. However, the reduction of execution time is not enough to compensate the growth of power dissipation, resulting in a net increase of energy consumption. 
NEON intrinsics
In principle, when using intrinsics in general, and NEON in the particular case of ARM cores, one can expect an increment in performance. However, an important aspect to analyze is the effect of this optimization on energy consumption since, as we observed in the previous subsection, a faster execution does not necessarily imply a lower energy consumption. The plots in Fig. 11 report the average FPS and EPF (top and bottom, respectively) including now the version of the library that integrates the NEON optimizations with the asymmetry-aware scheduling that enforces thread migration when necessary (labeled as Affinity ? NEON). For comparison purposes, we also show the results of the most significant previous versions.
This figure reveals important benefits in terms of both FPS and EPF. The increase in the FPS rate was expected, since the positive effects of SIMD intrinsics have been widely exposed in the literature for numerical applications in general and for video encoders/decoders in particular. At this point we would like to clarify that the results shown for Affinity ? NEON on 4 or less threads are equivalent to those obtained when introducing the NEON intrinsics in the unmodified version of the library (Unmodified ? NEON), since for these particular thread-number configurations all versions are equivalent from the point of view of the scheduling strategy. Using more than four threads does not improve the FPS rate of the Unmodified ? NEON version because the threads interleave their execution on the Cortex-A15 cores. In consequence, for simplicity, we do not include results for a solution that combines Unmodified ? NEON.
On the other hand, due to linear dependency between energy consumption and power-execution time, the reduction in EPF can be also expected if the power dissipation rate is not increased by a factor that exceeds the reduction of execution time. Figure 12 shows the average power dissipation for the experiments presented in Fig. 11 . There we can observe that the adoption of the NEON intrinsics yields a significant reduction of the average power dissipation compared to the asymmetry-aware version of the library without intrinsics. Together with the reduction in execution time, this explains the notable gains in energy efficiency thanks to the integration of the NEON intrinsics. Tables 2 and 3 respectively (averaged for all JCT-VC videos and four QPs). Moreover, these two tables quantify the differences (in %) between these implementations, with positive values reflecting an increment in the FPS/EPF, while negative values corresponding to a decrement. Note that all versions which spawn less than five threads are equivalent to the unmodified implementation, except when the NEON intrinsics are exploited.
Global comparison
The results in the first table expose that, for this particular SoC, the unmodified version of the library is not able to deliver the standard frame rate of 24 FPS for 1080p videos although it keeps four Cortex-A7 cores idle, which could have been leveraged for this purpose. A simple static manual binding of the threads upon initialization, which does not allow thread migration, does not increase performance, and when the Cortex-A7 cores are used, the performance is even decreased by factors that range between 9 and 25 % with respect to the default (Unmodified) configuration. Compared with this, the FPS rates are notably increased when the threads are migrated while taking into account the criticality of the top CTU rows, as well as by leveraging the NEON units. For example, compared with the unmodified version, the thread migrating policy yields an increase of performance between 11 and 21 %; the integration of NEON intrinsics (four or less threads in the main column labeled as Affinity ? Neon) delivers between 17 and 24 % higher performance.
From the perspective of energy efficiency, in the second table we observe that all versions reduce the EPF rate of the unmodified version of the library. However, despite the improvement of the FPS rate attained via thread migration, this version is less energy efficient than the one which carries out a static mapping of threads upon initialization. On the other hand, adding the NEON intrinsics is always beneficial, respectively, delivering around 19 % and 9-24 % more energy efficiency than the asymmetry-aware and static mapping counterparts which do not exploit the NEON SIMD engine.
Conclusions
We have proposed and evaluated an asymmetry-aware scheduling implementation of a reference HEVC decoder for the ARM big.LITTLE AMP embedded in the Exynos 5422 SoC/ODROID-XU3 board. Our solution follows the parallelization approach dictated by WPP to distribute the workload (CTU rows/tasks) among the fast (big) Cortex-A15 and the slow (LITTLE) Cortex-A7 cores on-the-fly, migrating the threads in charge of executing those tasks with higher priority to the former type of core. Moreover, the new implementation of the HEVC decoder is enhanced with NEON SIMD counterparts for all SSE intrinsics included in the reference implementation of the library.
Our results reveal excellent improvements in performance compared with the execution of the architectureoblivious reference implementation, which only exploits the big cores and cannot attain 1080p real-time decoding.
In addition, we demonstrate that the exploitation of the Cortex-A7 cores not only enhances the overall performance, but also contributes to improve the energy efficiency of decoding pipeline. This paper is an important step toward HD real-time HEVC decoding on low-power asymmetric multicore processors, but we believe that there are a number of issues that can be addressed to further increase performance and energy efficiency: (1) exploit the parallelism existing at the task level between the processing of the CTU rows and the filtering stage more efficiently and (2) analyze the source of the energy efficiency gains that were obtained with the ARM NEON intrinsics.
