Abstract-Reducing energy consumption is a challenge that is faced on a daily basis by teams from the high-performance computing as well as the embedded domains. This issue is mostly approached from an hardware perspective by devising architectures that put energy efficiency as a primary target, often at the cost of processing power. Lately, computing platforms have become more and more heterogeneous, but the exploitation of these additional capabilities is so complex from the application developer's perspective that their optimization is often limited.
Abstract-Reducing energy consumption is a challenge that is faced on a daily basis by teams from the high-performance computing as well as the embedded domains. This issue is mostly approached from an hardware perspective by devising architectures that put energy efficiency as a primary target, often at the cost of processing power. Lately, computing platforms have become more and more heterogeneous, but the exploitation of these additional capabilities is so complex from the application developer's perspective that their optimization is often limited.
In this paper we present a transparent, on-the-fly optimization scheme that allows a generic application to automatically and dynamically exploit the available computing units to partition its computational load. We have called our approach Heterogeneous Platform Accelerator (HPA). The idea is to use profiling to select a computing-intensive candidate for acceleration, and then distribute the computations to the different units by off-loading blocks of code to them. This is done automatically at run-time, thus requiring no effort from the developer and adapting to the current input data and load.
Using an NVIDIA Jetson TK1 board, we validate our proposal on several benchmarks and on a real-world software package, the Unix text editor ed. The results we achieve substantiate our claim that not only HPA results in faster processing speed, but also in a considerable reduction in energy dissipation.
Index Terms-Energy efficiency; automatic optimization; JIT; LLVM
I. INTRODUCTION
The energy consumption problem is one of the major limiting factor to more powerful devices -especially embedded ones, which would otherwise run out of battery very quickly [25] -as well as one of the major sources of expenses (and pollution) for big data centers [18] , [31] . Proposed solutions range from accepting a performance reduction in return for longer battery life, as in the Intel Atom processor [3] , to radical relocations of large data centers in cold regions [30] .
In parallel to this drift towards greener forms of computing, the last few years saw a general trend in the direction of heterogeneous computing platforms, mostly issued from the acknowledgement of the limitations imposed by physics and technology to the pursuit of ever faster devices [32] , [10] , [29] . While having multiple, more energy-efficient accelerators on the same board -or even the same chip -should have led to increased performances and reduced power dissipation (for instance, by using highly-parallel units for highly-parallel tasks), reality shows that the former target is only attained for very specific software customized for the accelerator, while the latter is rarely achieved due to the presence of a largely unused -but often leaking power -resources. The problem this time lies on the software side: writing software for an inhomogeneous, ever changing set of targets is difficult, expensive, and requires skilled and motivated developers and constant maintenance. This fact contrasts with the very same nature of software, which usually evolves at a slow pace [12] ; As an example, Android developers are accustomed to not knowing on which platform the software they are writing is going to be executed. Indeed, most of the optimizations in JAVA, C#, Swift, and other widespread programming languages are left at run-time, when the bytecode is first executed and information about the target are available, resulting in an optimized mapping to the underlying platform. However, this strategy reduces the scope of the optimizations available to the developer, forcing him to produce a generic code that is unlikely to fully exploit the hardware capabilities. To make software capable of dealing with an heterogeneous environment and broader improvement opportunities, a reasonable approach seems to be automation [20] , [21] , [4] , [6] .
In this paper we present a language-agnostic optimization system, called Heterogeneous Platform Accelerator (HPA), that automatically detects and delegates computing-intensive tasks to a dedicated accelerator, taking as an example of such accelerator the GPU chip available on an NVIDIA Jetson TK1 board. Our system grounds on the Low Level Virtual Machine (LLVM)'s Just-In-Time (JIT) compiler MCJIT [9] , [7] , [8] : executed code chunks are periodically analyzed by the perf event [36] performance monitor and, once a particular function is classified as computing-intensive, it is run through the Polly tool [33] to investigate whether it is parallelizable or not. If this is the case, the function is off-loaded to the GPU. The big advantage of this approach is that everything is totally transparent to the developer: he does not have to know on which platform his software will be deployed, as the potential accelerators will be discovered at run-time, nor he has to add any pragmas or adopt a particular library to allow the optimization. We defined our proposal "opportunistic" since no thorough mathematical analysis is performed prior to offloading the code: the system sees the optimization opportunity and tries to take advantage of it. Nonetheless, should the performance analyzer detect that the GPU is slower than the CPU for a given task, or that it is desirable to leave the GPU to another function that would make better use of it, the system can revert its operations and recall back the currently offloaded code. The unique limitation to the strategy we propose is the amount of intelligence and previous knowledge we are willing to put in the optimization scheme. To the best of our knowledge, no other approach proposed thus far exhibits the above-mentioned traits.
The paper is organized as follows: we first review previous work in the next section, focusing on the limitations imposed by current approaches. We then delve into the details of our proposed approach before presenting our experimental results on several different tasks. In particular, we face at first a set of standard toy problems from a widespread benchmark suite, we then analyse an image processing task in detail, we validate our proposal on three algorithms routinely used in many fields, and we conclude by examining the behavior of HPA on a realworld software package, the Unix text editor ed included in the Busybox 1 tool suite. We finally draw our conclusions and outline future research directions.
II. RELATED WORK
In the last few years a remarkable increase in the interest about heterogeneous platforms has been observed; GPUs, by their nature and capabilities, have been deep in the center of this small revolution. Whilst a large number of proposals have focused, with good results, on devising a language layer that could ease the development on such platforms [1] , [17] , [13] , relatively few implementations that automatically perform a multi-language, multi-target optimization appeared.
An interesting approach, called BAAR, is presented in [20] , [21] . The code to be executed is first statically analyzed using Polly [33] , a state-of-the-art polyhedral optimizer for automatic parallelization. If Polly detects that a function deserves being optimized, its code is compiled with the Intel's compiler and off-loaded to an Intel Xeon Phi board. All data transfers are dealt with using a software layer that handles them using Message Passing Interface (MPI). While captivating results are given, the major drawback of this approach is that it lacks workload adaptation: the analysis is performed at application's startup, and it does not account for changes in the execution context. The functions to be off-loaded are selected according to a metric that accounts for the number and type of operations to be performed, but this is only a rough indicator of the relative weight of the function in the context of the whole program's execution. As a consequence, this strategy is not reactive with respect to user input, and thus could miss some optimization opportunities -consider, for instance, the multiplication of a matrix whose size is user-specified.
Another proposal, named StarPU [5] , provides an API and a pragma-based environment that, coupled with a run-time scheduler for heterogeneous hardware, composes a complete solution. While the main focus of the project are CPU+GPU systems, it can be extended to less standard systems. However, the developer is requested to learn a new API and foresee which parts of the code should be optimized. This analysis is dependent on both the input and the available resources and thus, despite the additional efforts required to the programmers, will not always lead to improved results.
In the context of CPU+GPU systems, an approach that also investigates the energy consumption is presented in [6] : it introduces an automatic approach that maps, at run-time, the computations to the available computation units. The programmer is exposed to an API that allows him to write parallelisable code, leaving to the framework the task of generating the corresponding TBB and CUDA sources and adaptively distribute the computations. Interesting energy savings and computation speedups are recorded, but this comes at the expense of forcing the developer to learn a new API and, again, it is limited to the CPU+GPU scenario.
When a manually parallelized implementation is available (CUDA, OpenCL, OpenMP, . . . ), several optimization opportunities appear. For instance, [2] shows how an OpenCL application can be split over a CPU and a GPU maximizing the power-performance tradeoff. In particular, for a given application, models of the CPU and GPU executions are used to compute the expected efficiency and the effect on dynamic voltage-frequency scaling. Although in HPA we assume that we have no OpenCL code available, we could still devise a splitting strategy aimed at maximizing a predefined objective function, and thus obtain results akin to those in [2] . This, however, goes beyond the scope of this paper, and is left as future work.
An approach which is not restricted to a single target type is SoSOC [27] : it consists of a library that presents a friendly interface to the programmer and allows functions to be dispatched to a set of targets based on either the developer's wishes or some statistics computed during early runs. This dynamic aspect represents an improvement with respect to other proposals, such as [28] , where this mapping is performed at installation time, or StarPU. However, the developer not only has to learn (yet) another library, but also someone has to provide handcrafted code for any specialized unit of interest. This is a considerable waste of time and resources, and limits the applicability of the system to the restricted subset of architectures directly supported by the development team. A similar approach, though more focused on learning the mapping from the computations to the computation units, is presented in [24] . Another related framework, but which leverages Domain-Specific Languages (DSL), is given in [16] ; It features a dynamic run-time that provides automated targeting to heterogeneous computing units, but this capability requires the framework developers to write both the DSL specifications and the corresponding mapping to the underlying hardware.
VPE [4] represents an evolution of SoSOC: its focus is on transparency, which means that the developer does not even have to be aware that his code is going to be accelerated. Starting by executing the code in an LLVM's JIT-based framework, VPE detects which functions are computing-intensive and off-loads them to a remote unit. The code deployed is the very same code that is executed on the CPU, therefore no analyzed. Then, functions are identified and callers are created for all the potential candidates to off-loading. After the JIT-compilation step, the code is executed and profiling information are acquired. Once HPA detects that a given function is worth off-loading to the GPU, it transfers the required data and loads (or generates, if this is the first time the function is invoked) the PTX code that will be executed on the accelerator.
effort is requested to develop custom implementations. Results on a board based on the TI-DM3730 chip, which features a C64+ DSP processor, show gains up to 32× in performance. Nonetheless, the implementation is heavily customized for the examined board, since LLVM does not provide a backend for this platform, and therefore a set of scripts has to compile ahead-of-time the functions using the TI proprietary compiler. When a candidate function is found, VPE off-loads its operations to the DSP by executing the previously-compiled code on the data that have been allocated in a shared memory region. In this paper we adopt an approach closely related to [4] , but with some relevant differences. Besides the different target family -a GPU -and the availability of a backendthat allows us to compile on-the-fly the portions of code of interest -, we put a strong bias towards energy efficiency, investigating how the optimization impacts on the power dissipation. Moreover, we introduce an additional step in which computing-intensive functions are analyzed to detect the possibility of parallelizing them: indeed, it would be of little value to off-load sequential code to the 192-cores GPU of the Jetson TK1 board. We quantitatively investigate how the optimizations allowed by the use of a JIT-compiler in place of ahead-of-time compilation impact, in the context of our optimization scheme, both processing speed and power dissipation, and the overhead introduced by our framework. Finally, we perform our tests not only on a set of toy algorithms from a standard benchmark and an image processing demo, as in [4] , but also on numerical algorithms used on a daily basis -such as the Black-Scholes algorithm [11] -and an entire software of widespread adoption, the Unix text editor ed.
III. PROPOSED APPROACH
The fundamental building block of our proposal is the LLVM's JIT compiler. LLVM is an alternative to the widely known GCC compiler with a neater separation between the front-end, the optimization, and the back-end steps [7] . This separation is largely due to the adoption of an enriched assembly language, called Intermediate Representation (IR), that acts as a shared language between the different stages [9] . LLVM's community is very active, and recently a new JIT engine, called ORC, has been proposed. The previous engine, called MCJIT, presents indeed a serious issue: it has been designed to operate on modules -that is, aggregates of functions -, and once a module has been finalized (a mandatory step for execution), it is not possible to modify its code anymore. ORC solves this problem by allowing on-the-fly changes; however, it is currently under development and available for the x86 64 architecture only. For this reason, we have adopted the old JIT interface and implemented the transition across accelerators using a caller mechanism akin to that of [4] . In particular, we analyze at application startup which functions are available in the code -we automatically detect and exclude all system calls and I/O-based functions from our analysis, as we cannot optimize them with our approach -, and we replace function invocations with a caller that, when the function is not offloaded, simply executes the desired function on the CPU via a function pointer. Once a function is selected for offloading, we alter the function pointer to make it point to the function ready to be executed on the GPU. Please note that this caller overhead will be removed once ORC will be released for a broader selection of platforms, as we have already incorporated the new JIT interface in our framework.
Detecting whether a function deserves off-loading on the GPU or not is a two-step process: at first, functions that perform heavy computations are identified by the perf event [36] performance monitor. perf event collects very detailed statistics about software and hardware counters, and permits us to easily identify performance bottlenecks. In the context of this paper we rely on the CPU usage alone to identify functions which might benefit from being off-loaded, but many more optimizations could be devised; For instance, inspecting memory access patterns could suggest an improved layout for a data structure that reduces cache misses [34] . After having obtained a sorted list of functions candidate for acceleration, we inspect each of them sequentially, checking with Polly [33] whether it is parallelizable or not. Polly operates at the IR level and starts by translating the code to optimize to a polyhedral representation [35] . It then identifies the Static Controls Parts (SCoPs) [20] of the code matching a specific canonical form, and through a set of LLVM passes it performs the parallelism detection and optimization. In particular, it places OpenMP [19] pragmas around parallelizable blocks of code.
We detect those pragmas and use them to guide our choices. For the sake of simplicity, we operate at function level, offloading entire functions to the accelerator. Our approach is, however, independent with respect to the choice of the scale, so we could as well operate at the basic-block level [9] which could be interesting, for instance, for a multi-threaded function.
Once we have selected a function to off-load to the GPU, we generate on-the-fly the PTX code [26] that is sent to the GPU by using the LLVM's backend. Data are transferred to an accessible memory region using the dedicated CUDA API, and the execution is started. The results are finally transferred back once the computation is finished. Although CUDA 6.5 offers a Unified Memory scheme [23] , we experimentally verified that its performances are considerably worse than individual cudaMemcpy() operations. We have therefore opted for automatically inserting the required cudaMemcpy() instructions and taking the transfer time into account in our measurements. Please note that, once the memory sharing will be a viable option, the performance of HPA will be further improved.
The architecture of HPA is depicted by Fig. 1 : the input code undergoes an analysis step, during which functions are identified and callers for the potential candidates are created. The code is then executed by the JIT framework, and performances are monitored to identify optimization opportunities. Once a function deemed to be worth off-loading is detected, the corresponding PTX code is loaded -or generated by the LLVM's back-end if this is the first function invocation -and the required data is transferred to the target device. The function is, at this point, ready for execution, and its subsequent invocations will run on the remote target unless the system detects that it is under-performing and thus chooses to revert its off-loading decision to leave the device available to other code blocks.
IV. RESULTS
We validate our proposal using an NVIDIA Jetson TK1 board, which features a 4+1 ARM Cortex-A15 32-bit processor, running at 1GHz, with a 192-cores GK20A (Kepler) 852MHz GPU. This GPU, with respect to its predecessors, has a particular focus on energy efficiency. The installed Linux distribution is a Ubuntu 14.04 with a patched kernel distributed in the L4T (Linux for Tegra) package. We have set the CPU power governor to "performance" to guarantee the CPU the best possible performance. All the code we use has been compiled using Clang with strong optimization turned on (-O3), and we directly compare against statically-compiled code. Tests performed with GCC showed no significant difference with respect to Clang, therefore they will not be reported in this paper.
To estimate power consumption, we attached a shunt resistor to the ground power line and performed current measurements using a Tektronix MSO5104B Mixed Signal Oscilloscope. We recorded the voltage over extensive periods and then computed the average and standard deviation values for each explored case.
A. Validation on CLBG algorithms
We started by performing experiments with a toy test set similar to the one used in [4] : we considered a set of four algorithms from the Computer Language Benchmarks Game (CLBG) 2 , namely 2D convolution with a square kernel matrix (CONVOLUTION), multiplication of two square matrices (MAT.MULT), Mandelbrot set generation (MANDELBROT), and search of a nucleotidic pattern in an input DNA sequence (PRN.MATCH). Contrary to [4] , given the nature of the adopted accelerator, we put no limitations on the use of floating point numbers. To provide a fair comparison, we fixed the amount of data to process and manually parallelized the MULT algorithm as a function of the matrix size. Although the gap stretches as bigger and bigger matrices are considered, CPU-only execution is still faster the using the CPU+GPU setup. However, if we consider the power dissipation (center) and the fraction of dissipated energy with respect to classical CPU execution (bottom), we see that the execution in the context of HPA is considerably more efficient even when it takes more time to complete the computations.
CPU implementation of the algorithms to fully exploit the available cores. Table I details the time required for the execution of each test in the two situations we consider, that is, classical CPU execution and execution on the CPU+GPU in the HPA framework. Executing on the GPU, even if the code is automatically generated and not hand-crafted for it, results in higher performance in all but the MAT.MULT case. This latter case could be due to a non-GPU-friendly implementation of the standard algorithm, and exposes the major weakness Fig. 3 . Structure of the three filters used in the image processing demo. We have chosen them as they present a different degree of sparsity, and we wanted to investigate the capabilities of a JIT-based framework to exploit this information when optimizing the computations.
of our approach: as the code we execute on the accelerator was initially conceived for a stand-alone CPU system, we have little hope of achieving the performance of a carefully engineered CUDA algorithm. In our opinion, however, this drawback is largely offset by the complete transparency we offer to the developer. If the system detects a lower-thanexpected performance, it can at any time revert its choices and obtain as worst-case scenario the same performance the original code was supposed to achieve. This decision reversal consists again in a pointer value update which will give its results the next time the function is invoked. The overall cost of a wrong dispatching decision amounts to the slowdown exhibited by the target platform multiplied by the number of function calls occurred before the low-performance is detected, plus the overhead imposed by the data transfer. If we consider the power dissipation we see that, in our experiments, HPA is more energy efficient in all the cases, even when it takes longer to get to the final result, as it is the case for MAT.MULT, thanks to dynamic voltage-frequency scaling. We could therefore easily integrate an external policy that dictates which option to favor the most -best performance, best energy efficiency, or a combination of the twoand adapt the algorithm's behavior according to the run-time measurements.
We have further investigated the MAT.MULT experiment to see the behavior of the system as a function of matrix size. Fig. 2 details the execution time, the power consumption and the fraction of energy power used by the GPU with respect to classical CPU execution for a large set of matrix sizes. We can see that it is more convenient to operate on the CPU for matrices smaller than 200 × 200, as was already observed in [6] . Indeed, the data transfer time and the overhead introduced by the performance analyzer exceed by several orders of magnitude the computation time, and therefore it can be up to 30× more expensive, in energetic terms, to perform these multiplications on the GPU. For bigger sizes, despite the slightly higher execution time, the reduced energy dissipation makes the HPA alternative interesting.
B. Image processing demo
To further explore the capabilities of our solution, we developed an image-processing demo where different 2D filters are applied on the frames of a 1280 × 720 MPEG-4-encoded video. The image decompression and visualization tasks are performed using the Tegra version of the OpenCV Fig. 4 . PTX code generated by the LLVM back-end for the generic convolution code. The generic convolution case has to perform multiple load, add, and multiply-and-add operations, since it cannot make any hypothesis on the filter structure.
library to avoid saturating the CPU with operations unrelated to the image-processing task itself, though this slightly penalizes HPA as it reduces the GPU resources available for the convolutions. In our experiment, we JIT-compiled the module including the Constant Propagation optimization and performed our measures using three filters -(emboss, Sobel y-derivative, and sharpen) -that have a different structure, and in particular different sparsity, as depicted by Fig. 3 . In the comparison, we also considered the convolution operation performed by using the Tegra-OpenCV library: this version of the convolution is indeed optimized for the accelerator, and we expected it to outperform the HPA version. Our results are listed in Tab. II. Unsurprisingly, operating on CPU+GPU resulted in significantly faster computations and thus, given a lower power dissipation, a considerable reduction in the required energy with respect to CPU execution.
The comparison with the Tegra-OpenCV implementation of the convolution operator is interesting: we can see that the frame rate of the hand-optimized convolution operator is lower than the one achieved by HPA for both the emboss and the Sobel filter. This is due to the JIT compiler performing some small optimizations, and the genericity of the OpenCV version of the convolution preventing it from adopting architecturedependent shortcuts.
Recent studies in the compiler field, such as those presented in [22] , suggest that the margin between Ahead-Of-Time and Just-In-Time compilation can be narrowed down even in the general case by cleverly manipulating the execution flow and allowing for speculation. As an additional test, we wanted instead to ascertain whether JIT compilation could turn to a neat advantage for user inputs exhibiting a particular structure, since the code can adapt on-the-fly to the input data. We have observed that, in the sharpening filter case, the OpenCV version can partially benefit of the presence of several zero entries in the filter matrix, but still it cannot outperform the HPA results when Constant Propagation is used in the JIT compilation. This latter, can indeed take advantage of the presence of elements with opposed sign, such as 1 and −1, and exploit this information to further speed-up computations. An example of these optimizations can be seen by comparing Fig. 4 , which shows the generic convolution code that is normally executed by HPA, with Fig. 5 , depicting the code produced by JIT's Constant Propagation in the sharpening filter's case. While the generic convolution code has to perform operations for each element of the convolution kernel, its optimized version can take advantage of the kernel's structure by skipping all the operations with zero-members, significantly reducting the number of total operations performed. These results support our claim that higher efficiency, in both computational and energetic terms, can be achieved by casting a problem in the HPA-JIT framework.
C. Validation on numerical algorithms
We validated our findings on a set of numerical algorithms routinely used in many fields. In particular, we considered the three following tasks 3 : • Black-Scholes algorithm [11] , used to give an estimate of the pricing of European financial products; • Spectral norm computation [14] on a 4096 × 4096 elements matrix; • Jacobi iterative method [15] , a classical algorithm used to solve linear systems of equations, on a 128×128 elements matrix. The results of our experiments are depicted by Fig. 6 . Again, we can observe that the execution in the HPA framework allows for a significant reduction in execution time (up to almost 20×), almost halving at the same time the power dissipation.
D. Optimizing a real-world software: ed
While the results presented above demonstrate that HPA is well suited to optimize specific algorithms, the interest in these algorithms themselves has pushed several people to write custom implementations for several different platform. As we have shown in the image processing demo, we have very little chances to perform any better than a hand-made implementation for a given target. However, many software packages currently in use, especially in the industrial domain, are stuck to their single-core, CPU-only implementation, while few others can benefit of a multi-threaded -but still CPUonly -implementation. In these situations the transparency aspect of HPA pays a large dividend with respect to the other approaches we have introduced in the Related Work section: 3 The source code of these applications can be downloaded here: http://reds-data.heig-vd.ch/publications/optim 2016/src apps.tar.gz we can, indeed, take a commonly-used software and simply run it in the HPA framework, taking thus advantage of the increased performances and the reduced power dissipation with no effort from our side. To substantiate our claim, we considered the ed text editor, available on any POSIX-compliant operating system, and we have run it in the HPA framework. We considered as a task to perform the search of a word in a 122MB file. ed performs a sequential, line-by-line search of the desired word, stopping at its first occurrence and reporting its position. The sought word has been put in several different positions in the text file to simulate a real setting, and we averaged our measures to provide a fair evaluation. The results we have achieved are presented in Tab. III. Again, we can see that executing the code in the HPA framework gives a clear advantage in terms of both execution time and power dissipation.
E. Overhead analysis
We finally analyzed the overhead introduced by our framework. In particular, we computed the timestamp difference between the beginning of the JIT compilation and the moment of the execution of the JIT-ed function. This allowed us to compute the time it took to perform the code changes necessary for the execution in the HPA framework (namely, the creation of the callers), as well as the time requested for the JIT-compilation of the resulting code. The measured values are given in Tab. IV. It can be seen that these values are limited (below 75ms), making them negligible for multiple invocations of the considered functions. We have then estimated the perf event overhead by comparing the execution times of HPA with and without statistics collection. From our measurements, perf event led to an overhead of about 1.8%, which again is negligible if we consider the considerable speed-up given by the accelerator to the overall execution.
V. CONCLUSION
In this paper we presented an opportunistic strategy to increase energy efficiency by automatically off-loading computational intensive fragments of easily-parallelizable code to a GPU accelerator. As a side effect, we obtained a significant increment in the overall performances, as not only these computations are performed faster, but also the main CPU load is relieved and thus it can accomplish further tasks while waiting for the computations to terminate. We supported our claims with thorough experiments on several numerical algorithms, an image processing task, and a POSIX-compliant text editor.
As future work we will focus on defining other optimization strategies and investigate platforms with a higher degree of heterogeneity, choosing at run-time the target that is expected to give the highest energy efficiency or fit best to a set of user-defined policies.
