Abstract This paper addresses the efficient exploitation of task-level parallelism, present in many dense linear algebra operations, from the point of view of both computational performance and energy consumption. The strategies considered here, referred to as the Slack Reduction Algorithm (SRA) and the Race-to-Idle Algorithm (RIA), adjust the operation frequency of the cores during the execution of a collection of tasks (in which many dense linear algebra algorithms can be decomposed) with very different approaches to save energy. The procedures are evaluated using an energy-aware simulator, which is in charge of scheduling/mapping the execution of these tasks to the cores, leveraging dynamic frequency voltage scaling featured by current technology. Experiments with this tool and the practical integration of the RIA strategy into a runtime show the energy gains for two versions of the QR factorization.
Introduction
Energy consumption is already recognized as the crucial factor that will limit the performance of future microprocessors, leading to the design and adoption of heterogeneous architectures and dark silicon [1, 2] . On the other hand, already today, large-scale HPC facilities are notable consumers of energy, which is employed to operate the computing resources as well as auxiliary systems like backup equipment, air cooling, etc. [3, 4] . Energy consumption has a direct impact on the operation and maintenance costs of these centers, compromising their existence and impairing the installation of new facilities. But the electricity cost is not the only problem; in general, energy consumption results in carbon dioxide emission, a hazard for the environment and public health, and heat, which reduces reliability of hardware components [5] .
The pressure from HPC centers, combined with that of the embedded and mobile market segments, has forced hardware manufacturers to improve the energy efficiency of their designs: CPU, memory chips, network cards and hard drives usually feature low-power modes and, in some cases, the possibility of decreasing frequency and voltage operation (DVFS or Dynamic Voltage Frequency Scaling [6] ), to trade power for performance. While being a mature subject in other segments, the development of energy-aware software for HPC applications, which optimizes both execution time and energy conservation, is still in its beginnings, in spite of the enormous assets it can yield [7] .
Recent work has demonstrated the performance improvements which can be gained by exploiting task-level parallelism in multi-core processors; see, e.g., [8] [9] [10] . Following this trend, numerical dense linear algebra libraries like libflame [11] and LAPACK [12] are being rewritten to leverage task-level parallelism for this class of architectures.
In projects libflame and PLASMA [13] , (blocked) dense linear algebra algorithms are statically or dynamically decomposed into a collection of tasks (or kernel operations), identifying the data dependencies among them. The result is a Directed Acyclic Graph (DAG) that captures the dependency information implicit to the algorithm, and which is then passed to a scheduler in charge of issuing tasks to the computational resources. As a result, tasks are executed in the order dictated by data dependencies (data-flow parallelism) instead of the order they appear in the code (controlflow parallelism), which unleashes a richer degree of concurrency. Unfortunately, as-of-today, these projects address high performance but ignore energy consumption.
In this paper we investigate how to leverage DVFS and energy-friendly processor states during the execution of dense linear algebra algorithms on state-of-the-art multicore processors. In particular, we consider a DAG that represents a collection of tasks and data dependencies among these, corresponding to the computation of a dense linear algebra operation. Our goal is to detect which tasks lie in non-critical paths, in order to adjust the frequency/voltage (DVFS) of the processor cores in charge of their execution, and thus save energy.
The main contributions of the paper include:
-We review the Slack Reduction Algorithm (SRA) [14, 15] , which aims at exploiting the slacks (idle periods) existing in the DAG that represents a dense linear algebra operation by carefully tuning frequency execution of certain tasks. -In addition, we employ a simulator [14, 15] to validate the theoretical energy gains that can be attained with SRA as well as RIA, an alternative that takes a completely different approach to save energy. The simulator calculates a schedule of the tasks for both strategies, taking into account practical constraints like actual number of resources (processor cores), the cost of varying processor frequency, the discrete range of frequencies, the granularity of DVFS operation (core-or socket-level), etc. -In the simulations, we employ experimental performance and power figures for two real recent processors, an AMD Opteron 6128 (two eight-core sockets) and an Intel Xeon E5504 (two four-core sockets) to enhance the accuracy of our results. -We analyze the energy performance of both strategies by using highly efficient blocked algorithms for two different versions of the QR factorization, an important operation for the solution of linear systems and linear least-squares problems. -Finally, we incorporate the energy saving techniques underlying RIA into the runtime integrated with the libflame library [11] , experimentally showing the energy savings introduced by this strategy.
This paper extends the results in [14] as follows. First, we consider here the analysis of two variants of the QR factorization: the traditional slab-based procedure [16] as well as the incremental QR factorization introduced in [17] . Therefore, we replace the Cholesky factorization employed in [14] with the traditional QR factorization, an algorithm that is more challenging from the point of view of slack control, as its tasks exhibit varying sizes and costs during its execution. (The Cholesky factorization shares with the incremental QR factorization the property that, once the block size is fixed, the two algorithms can be decomposed into a reduced number of task classes, with the task in each class having a constant cost.) Second, we target the same AMD architecture that was employed in our previous work (using now 16 cores instead of 8), but also include a second architecture, from Intel, fundamentally different in that DVFS is applied at the socket level instead of at the core level. Our experimental results will show the impact of this feature on the attainable theoretical gains. Third, we introduce real experimental data for the performance and energy consumption of the computational tasks appearing in these algorithms on these two architectures. Thus, we expect that our simulation results reflect the trade-off between performance and energy more accurately than those in [14, 15] . Fourth, we validate the energy/time results by integrating the RIA strategy into the runtime system for libflame. (The work in [14, 15] only included simulation results.)
The article is organized as follows. After a brief discussion of related work in the next subsection, Sect. 3 reviews the Critical Path Method and its application to identify tasks which can be delayed without negatively affecting the total execution time of a project, represented by a collection of dependent tasks. The SRA and RIA energy-saving algorithms, introduced in Sects. 4 and 5 respectively, are followed by a brief discussion of the energy-aware simulator, in Sect. 6. Experiments are reported in Sects. 7 (simulation) and 8 (experimental); and a few concluding remarks as well as a summary of future work close the paper in Sect. 9.
Related work
There exist a number of related investigations to our work. In [18] , the authors model a scheduler for clusters that can map tasks and adjust node frequencies, depending on the number of pending jobs. In [19, 20] the authors discuss scheduling of independent tasks (jobs) in a DVFS-enabled processor, while in [21] this technology is used to schedule tasks with dependencies in a multiprocessor setup. The authors of [22] introduce several real-time, energy-aware schedulers for tasks with dependencies. The work in [23] describes a platform that combines real-time mapping with DVFS to reduce energy usage of dependent tasks. The algorithm LPHM in [24] dynamically adjusts the execution time of noncritical tasks using DVFS.
In [25] new heuristics are proposed for an energy-aware task scheduler in a heterogeneous cluster. In [26] a strategy is employed to stretch or reduce the execution time of noncritical jobs. In [27] , the authors perform a similar investigation, but frequency is statically tuned at the beginning of the algorithm, and fixed for its complete duration. The authors of [28] also follow the same strategy, with stretch/compress stages, which are iteratively applied until the consumption of power is below a certain threshold. Finally, algorithm LPHEFT is presented in [29] as a means to reduce energy consumption, based on scheduling of idle time-slots (or gaps).
Our work differs from previous efforts in that we focus on giving practical solutions to a domain-specific problem: the solution of task-parallel blocked algorithms for dense linear algebra, on current multi-core processors.
The critical path method
The Critical Path Method (CPM) is commonly used in the field of management and project planning [30] to control the duration of the project by carefully scheduling so-called "critical" tasks (that is, those tasks which are likely to delay the project execution time in case of a late start/finish). We next discuss how to apply CPM to detect slacks in a DAG that captures the dependencies existing in one of the algorithms for the QR factorization.
Demonstration example
We illustrate the CPM using the incremental QR factorization of a matrix A ∈ R m×n , which renders the decomposition A = QR, where Q ∈ R m×m is orthogonal and R ∈ R m×n is upper triangular. (For simplicity, hereafter we assume that A is square, i.e., m = n.) Consider a partitioning of this matrix into blocks of size b × b and denote the (i, j ) block in this partitioning as A ij . (For simplicity, we will assume hereafter that n is an integer multiple of the block size b, i.e., there exists an integer s such that n = s · b.) Algorithm 1 presents a blocked procedure to compute the incremental QR factorization of A, overwriting the upper triangular part of the matrix with R. Each operation (i.e., task) in the algorithm is annotated to the right with its theoretical cost in floating-point arithmetic operations, or flops. (Hereafter we neglect lower order terms in the evaluation of theoretical costs.) Figure 1 illustrates the tasks and dependencies obtained for the QR factorization of a blocked 3 × 3 matrix using Algorithm 1. There, "G" denotes the QR factorization, "O" the application of (orthogonal) transformations, "G2" the Algorithm 1 Right-looking blocked algorithm for the incremental QR factorization 2 × 1 QR factorization, and "O2" the 2 × 1 application of (orthogonal) transformations (see Algorithm 1). The name of the task is followed, after the underscore, by a number that uniquely identifies the task in the graph which can be derived from the loop indices k, i and j of the algorithm. Finally, inside parenthesis, we include the execution time of the corresponding task (in milliseconds, ms) for a block size b = 256 on a single-core of an AMD Opteron 6128 running at 2.00 GHz. This type of information will be used next to illustrate the energy-saving approach of SRA.
We define the slack of a task as the amount of time that it can be delayed without increasing the total execution time of the algorithm.
CPM can now be applied to the task-node graph in order to detect slacks, which yields the profile in Table 1 . The application of CPM obtains the following data for each task T i : -Its cost C i (execution time). -The earliest time at which it can start its execution,
for all task (node) k connected to task (node) i by an edge from task k to task i. -The latest time at which it can finish its execution without increasing the total time of the algorithm, given by
for all node k connected to node i by an edge from i to k. -Its slack, obtained as
CPM identifies slack times, but does not provide an explicit strategy (procedure) to exploit them. In the following section, we introduce an algorithm that conveniently tailors operation frequency of the processor cores, to tune the slack of those tasks with S i > 0, yielding a lower power usage. In an ideal case where the cores can operate at an infinite (continuous) range of frequencies, and the transition time (overhead) between any two frequencies is zero, the slack could be adjusted very accurately. In a real case, processor cores run at a limited (discrete) number of frequencies, and switching between any two given frequencies is not immediate, so that the slack can only be adjusted sub-optimally. Furthermore, in some architectures (e.g., those from Intel), changes can only occur at the socket level as opposed to CPU or core level.
Slack reduction algorithm
The SRA [14, 15] assigns a tentative operation frequency to each task, among a discrete number of these, at which it will be executed. The algorithm is preceded by an initialization stage that decomposes a given dense linear algebra algorithm into a collection of tasks, and identifies the dependencies among these, resulting in a DAG (alike that in Fig. 1 ).
The SRA is composed of three stages, with the second and third stages being iterative procedures. To illustrate the algorithm, in the following discussion we will refer to the DAG in Fig. 1 . We will also consider the discrete collection of frequencies F R = {2.0, 1.5, 1.2, 1.0, 0.8}, (in GHz, conformal with current values in AMD Opteron 6128).
Frequency assignment Initially, all tasks are assigned to be run at the maximum frequency.
Critical subpath extraction In this step the graph is decomposed into a number of critical subpaths. First, the critical path is identified. Next, the graph edges that belong to the critical path (as well as the nodes, if they have no other edge arising at or leaving from them) are eliminated from the graph. A new critical subpath is then extracted for this subgraph; and the process is repeated until the graph is empty. Figure 2 details the application of the procedure to the DAG contained in Fig. 1 . For each iteration of the extraction procedure, we indicate the sequence of nodes in the Frequency tuning This stage calculates the (recommended) operation frequency and, therefore, dictates the execution time of the tasks and the overall execution time of the algorithm. The procedure starts by processing the first critical subpath, CP 1 = {O_131, O2_231, O_232}, trying to annihilate the slack of the tasks embedded in this subpath (see Table 1 ). For this purpose, the procedure initially computes the lengths of the longest paths (LF ) from task G_111 to the first and the last task in the subpath, O_131 and O_232, respectively. These values are 1.256 and 7.102 ms, respectively, and their difference, 5.846 ms, provides a bound on the maximum duration of the execution of the tasks in CP 1 . Given that the execution time of the tasks in this subpath is 1.214 + 1.597 + 1.214 = 4.026 ms (column C in Table 1 ), this implies that there is a slack of 5.846 − 4.026 = 1.820 ms that can be distributed among O_131, O2_231 and O_232. How the slack is shared among these tasks is completely arbitrary, provided the individual slacks for these tasks (column S in Table 1 ) are not exceeded. In our specific implementation, the ratio 5.846/4.026 = 1.452 recommends a rough increase of 45 % to the execution time of the tasks in the subpath.
The procedure is then repeated for CP 2 , CP 3 and so on, yielding a recommended execution time for each one of the tasks of the DAG; see Fig. 3 .
The race-to-idle algorithm
The alternative strategy to save energy consumption, RIA, leverages the fact that current processors are quite efficient at reducing power when idle. Therefore, it may be more convenient to complete the execution of tasks as soon as possible, so as to "enjoy" longer periods of inactivity. In other words, RIA executes the tasks of the algorithm at the highest possible frequency, while reducing frequency to the lowest possible during idle periods. This strategy can be combined with the use of energy-friendly processors states.
This strategy requires no processing of the DAG that reflects the tasks and dependencies of a blocked algorithm, as all tasks are executed at a highest frequency, while during the idle periods, the CPU frequency is reduced as much as possible. Whether this is advantageous depends on the specific trade-off between performance and energy of the computational kernels, the existence and length of idle periods in the dense linear algebra algorithms, and the overhead incurred to change the operation frequency, which is basically dictated by the target hardware platform.
Simulator
In order to evaluate the performance of the strategies, we employ a flexible, energy-aware simulator [14, 15] which uses the information obtained with SRA and RIA to produce a schedule, for a particular target architecture. It also records all frequency variations occurred during the execution, and displays statistics on energy saving in terms of percentage of time that each computing resource has operated at a given frequency. Therefore, this tool can help to analyze the theoretical performance and energy savings produced by the application of SRA and RIA in different scenarios: DAGs associated with different task-based algorithms, platform setups, excess ratios, frequency ranges, etc. In general, the static (a priori) schedule produced by the simulator is not applicable in practice as even tiny variations during task execution may render it useless. However, it serves as a demonstrator of the benefits that a technique like an energy-aware DVFS-based strategy can yield for the execution of dense linear algebra kernels.
In the following, we describe the possibilities and properties of the simulator in more detail.
Input parameters
The simulator receives the following inputs: -A DAG capturing tasks and dependencies implicit in the blocked algorithm as well as the frequencies recommended by SRA/RIA to execute the tasks (in the case of SRA, the frequencies are precisely those obtained from the application of that procedure; in RIA, all tasks are to be run at the highest frequency). -A simple description of the target architecture that specifies the number of sockets (or physical processors) and the number of cores per socket. To mimic current technology from AMD (PowerNow!) and Intel (SpeedStep), the simulator can adjust the frequency at core/socket level, respectively. Furthermore, core frequency cannot be changed if there is a task running on it at the moment. Algorithm 2 Right-looking blocked algorithm for the QR factorization -A discrete range of core frequencies, F R.
-A collection of real power consumption for each combination of frequency, {idle/busy} state per core. To obtain these values in our simulations, we executed a benchmark that varied the frequency/load of the cores, trying all combinations. -The cost (overhead) incurred to perform frequency variation.
Scheduler
As starting point, we have chosen a static priority list scheduler [31, 32] . The reason for this is twofold. First, the approximate duration of the tasks is known in advance (as, in general, is the case for dense linear algebra). Second, the execution of tasks that lie on the critical path must be prioritized. Thus, tasks to be executed at higher frequency are assigned raised priority. Among the tasks which have to be run at the same frequency, those which belong to a critical subpath (CP i ) with smaller index (i) are sorted first. Consider the execution of a task T i with recommended frequency f i (set by either the SRA-or RIA-based strategy). The scheduling algorithm maps execution of T i to the first idle core that satisfies one of the following conditions, checking them in the order they appear next:
1. The core socket is operating at frequency f i . 2. The core socket is varying its operation frequency to f o = f i . (The task will commence execution when the change is completed.) 3. The core socket is operating at a frequency f o > f i . 4. The core socket is varying its operation frequency to f o > f i . 5. All cores in the same socket are idle. If the socket is operating at a frequency f o = f i , a change of frequency to f i is requested. The socket is reserved so that T i will be the first task that will run on it when the change is completed.
If none of the above conditions is satisfied, the task remains in the pending queue, waiting for variations in system. This strategy will ensure that the execution time of a task does not require longer than dictated by SRA. For that purpose, the schedule guarantees that the task is executed in a core running at the desired frequency or, if not possible, at a higher one. For RIA all tasks are run at the highest frequency.
Simulation results
We next evaluate the performance of SRA and RIA combined with the energy-aware simulator/scheduler using the two blocked algorithms for the QR factorization.
Benchmark algorithms
In our experiments, we analyze the right-looking blocked algorithmic variants for the incremental QR factorization (Algorithm 1) and the traditional (slab-based) QR factorization (Algorithm 2, in Fig. 4 ). These two algorithms represent the state-of-the-art to attain high performance in the solution of linear systems and linear least-squares problems on current multi-core processors [33] . The traditional QR factorization of a matrix A proceeds in panels of b columns (slabs), as illustrated by Algorithm 2. Each operation in the algorithm is annotated to the right with its theoretical cost, which depends on the loop index k. Figure 4 shows the DAG obtained when this algorithm is applied to a blocked 3 × 3 matrix A. There, "G" represents the QR factorization and "O" the application of the orthogonal transformations.
Environment setup
We emulate two target platforms in our experiments. The first one consists of two AMD Opteron 6128 processors (2 sockets with 8 cores for a total of 16 cores), with the range of frequencies F R = {2.00, 1.50, 1.20, 1.00, 0.80} GHz. The second architecture consists of two Intel Xeon E5504 processors (2 sockets with 4 cores for a total of 8 cores), with the range of frequencies F R = {2.00, 1.87, 1.73, 1.60} GHz. Tables 2 and 3 report the latencies incurred to change between any two frequencies in these two platforms, obtained experimentally. Table 4 lists the voltage/frequency pairs for these architectures.
Our experiments consider a variety of (square) matrix dimensions ranging from 384 to 2,944 with the block size b = 128. This size was experimentally determined to be close to the optimal for most kernels involved in these factorizations. For the incremental QR factorization, we measured the execution time of the four different types of tasks ("G", "O", "G2" and "O2") running on a single core of an AMD Opteron 6128 (8 cores)/Intel Xeon E5504 (4 cores) at each possible frequency. For the traditional QR factorization, the execution time of tasks G and O depends on the iteration. To estimate the execution time in this case, we evaluate the time of 1 flop for each type of task involved in the algorithm and, from the theoretical cost of the task, we then determine its execution time. Using this data, we apply SRA to adjust the execution frequency of the tasks.
To obtain an approximation of the power required by the computational tasks, we evaluated the DGEMM kernel (from Intel MKL 11.0) executed at all possible combinations of frequencies on 1, 2, 3, . . . , p cores, with p = 16 on the AMD and p = 8 on the Intel. Thus, we measure the power using p cores, simultaneously running p copies of the kernel at frequencies f 1 , f 2 , . . . , f p , possibly different, while all remaining cores are idle, at the lowest frequency if SRA/RIA are in place or at the highest frequency when no energy-saving strategy is applied. Power measures were obtained using an internal DC powermeter. This is a microcontroller-based design operating at 25 Hz (25 samples per second) directly attached to the lines connecting the power supply unit with the motherboard (chipset plus processors).
Metrics
In order to assess the benefits of the proposed solution we employ the following metrics:
-The Impact of SRA/RIA on time (IT SRA/RIA ) measures the ratio between the execution time of the algorithm operating at the frequencies dictated by SRA/RIA and that obtained when all tasks are run at the highest frequency. This ratio is obtained by dividing the execution time of both variants:
T exec Ideally, this ratio should be 1. In RIA, the overhead due to the frequency changes may render a ratio higher than 1. In SRA, there may appear a certain overhead due to frequency changes as well, but when this occurs, it is mostly due to the algorithm being oblivious to the real number of available resources (cores). A resource-aware implementation of SRA would solve this issue and is among future work.
-The Impact of SRA/RIA on consumption (IC SRA/RIA ) measures the ratio between the energy consumption of the algorithm operating at the frequencies dictated by SRA/RIA and that obtained when running all tasks at the highest frequency. This ratio is obtained by dividing the energy consumption of both variants:
C exec Ideally, this ratio should be close to 0. A longer execution time and the overhead introduced by frequency changes may yield an energy ratio higher than 1.
The traditional QR factorization
Figures 5 and 6 report the results for the traditional QR factorization on the two platforms. The usage of SRA produces an increase in the execution time for the largest problem sizes (from n = 2,304 on the AMD and n = 1,536 on the Intel) due to the resource-unaware current implementation. On the other hand, RIA maintains the execution time for all problem dimensions on both platforms, demonstrating that the overhead due to frequency changes is negligible compared with the cost (i.e., time) of the individual tasks (at least, for such block size). From the point of view of energy, SRA outperforms RIA as long as there is no increase in the execution time (small to moderate problem sizes). Comparing both architectures, it is important to emphasize the benefits that the flexibility of operating the frequency per core in the AMD platform yields. Thus, while the energy savings estimated for this platform attain up to a 20 % (SRA for n = 1,792), the reduction for the Intel platform are much more modest, with a peak around 5 % (SRA for n = 640).
The incremental QR factorization
Due to the cost of the simulation and the higher complexity of the DAG associated with the algorithm for the incremental QR factorization, in this case we could only evaluate the impact of SRA for problems of dimension n up to 1,408. Figures 7 and 8 report the results on both platforms. In general, the behaviour of the execution time for SRA is similar to that already observed in the previous algorithm, growing with the problem size. The benefits of SRA on energy consumption are evident for the AMD platform, especially for the smallest problem sizes, while this strategy only yields low energy savings for the two smallest problem sizes on the Intel architecture. With the exception of one particular case (n = 896, AMD platform), RIA keeps the ratio of execution time close to 1. The energy savings yielded by RIA are always competitive or superior to SRA on AMD, as well as on Intel when n ≥ 896. In general, the poor results obtained on the Intel platform with any of the two energy-saving strategies are due to the granularity of DVFS which limits frequency changes to the socket-level.
Experimental results
We next describe the practical results of the integration of the RIA strategy with the SuperMatrix runtime that underlies the lifblame library. Supermatrix is an execution framework for dense linear algebra which automatically decomposes an operation into tasks, identifies dependencies among these, and dynamically schedules them for execution to the cores of a parallel platform (i.e., at run time) without direct intervention from the programmer. For this purpose, SuperMatrix proceeds in two stages. During the initial stage, a symbolic execution of the code produces a directed acyclic graph (DAG) containing all tasks and dependencies. This information then dictates the feasible orderings in which tasks can be executed during the subsequent dispatch stage. To monitor progress, the SuperMatrix implementation utilizes a pending list which contains those tasks to be run but which depend on tasks not yet executed. At the beginning of this second stage, all tasks except for that corresponding to the factorization of the first panel are in the pending list. From this structure, a task is moved into the ready list when all its dependencies are fulfilled. Initially this list contains only the factorization of the first panel. Idle threads (one per core) continuously check the ready list for work (busy-wait or polling). When a thread acquires a task, it runs the corresponding job in the associated core and, upon completion, checks the tasks which were in the pending queue, moving them to the ready list in case all their dependencies are now satisfied. Details on the operation and implementation of the SuperMatrix runtime can be found in [33] .
Accommodating RIA into SuperMatrix
The aim of this technique is to replace the polling state of "inactive" threads by an energy-friendly, blocking one. Whether these theoretical savings yield an actual gain will depend, however, in the existence/length of idle periods during the execution of the algorithm, and the overhead of blocking/activating a thread. In our implementation we employ POSIX semaphores to control the active threads. Now, when a thread polling for a new job from the ready list receives a negative answer (there is no task ready for execution at the moment), it blocks itself (with the system call sem_wait()). When a thread completes the execution of a task, it updates the dependencies of the tasks in the pending list; besides, in case this implies moving k tasks from the pending list to the ready list, this thread will also enforce that there exist k − 1 active threads (using system call sem_post() to activate other threads, if necessary). This simple mechanism ensures that there is basically one active thread per task in the ready list and, key to energy conservation, that no continuous polling is being done on an empty list, thus enabling the exploitation of energy-friendly processor states. This technique is also combined with the application of DVFS to reduce the frequency of inactive threads.
Environment setup
All experiments reported in this section were obtained using IEEE double-precision arithmetic on an 16-core AMD 6128 processor (two 8-core sockets, 2.0 GHz) with 24 Gbytes of RAM. The system runs a Linux Ubuntu 10.04 distribution. Highly tuned implementations of BLAS and LA-PACK were those in MKL 10.2.4. A version of SuperMatrix runtime in libflame version 5.0-r6719 was developed to leverage the RIA energy-saving technique. Execution times/power measurements correspond to that of routines FLASH_QR_UT for traditional QR factorization and FLASH_QR_UT_inc for incremental QR factorization from this library, linked to the original and energy-aware implementation of the runtime. Our evaluation includes a variety of (square) matrix dimensions ranging from 1,024 to 12,288 and the block size b = 128. This block dimension was close to optimal for most kernels involved in this factorization.
Power was measured using the internal DC powermeter described earlier.
Results
The following experiment measures the actual gains that can be attained by our energy-aware approaches, evaluating the real savings introduced by the RIA strategy. To analyze the impact of this technique, we compare the original SuperMatrix runtime with the one which uses semaphores to block and reduces operation frequency during idle threads and avoid polling. Figure 9 illustrates the effect of the energy-saving strategy both on the execution time and energy consumption. The graphs on the left-hand side report relative performance results with respect to the original runtime, while the ones on the right-hand side correspond to relative results of energy consumption. The results demonstrate that the RIA strategy introduces a minimal overhead (in terms of longer execution time, due to the period required to "wake-up" blocked threads) in the execution time. For the traditional QR factorization, there is no appreciable impact on execution time due to the application of the energy-saving variant. On the other hand, the effect on energy efficiency is relevant. For the smallest problem sizes, the traditional QR leads to savings of up to 17 % and tends to stabilice for largest sizes producing savings of 20 %.
For the incremental QR factorization, there is an increase on time of 10 % at most for the smallest problem sizes while, for the largest problem dimensions, there is no appreciable difference between the two versions of SuperMatrix runtime. On the other hand, for this algorithm there is no energy-saving effect, due the parallelism offered in this algorithm provide negligible inactive periods of threads which blurs the opportunities to save energy.
At this point, it is worth pointing out the differences between the estimated energy savings calculated with the simulator and the practical savings attained when RIA is integrated into SuperMatrix. A careful investigation of this issue revealed that the scheduler in libflame does not properly prioritize the execution of tasks on the critical path, resulting in periods where a significant portion of the architecture cores are idle. It is precisely by leveraging these idle periods that the integration of the RIA strategy within SuperMatrix thus yields the observed energy savings, higher than those predicted by the simulator (which assumed a perfect scheduling of critical tasks).
Conclusions and future work
In this paper, we provide evidence that it is possible to improve energy consumption during the execution of dense linear algebra algorithms while, in some cases, maintaining their performance. Following the current trend for multicore parallelization (adopted, e.g., in libraries libflame and PLASMA), our algorithms exploit task-level parallelism, considering that dense linear algebra operations are partitioned into a number of tasks with dependencies among them. Our energy-conservation strategies, SRA and RIA, start from the DAG representing the operation and are based on two key observations. First, if all the tasks run at full speed, idle times appear during their execution. Second, present processors include efficient mechanisms to dynamically adjust frequency/voltage (DVFS) and hence the energy consumed.
The SRA is inspired by concepts and methods of project planning theory. Specifically, we first apply CPM to determine the individual slack of each task, and then employ SRA to conveniently slow down the execution frequency of the appropriate tasks, while potentially maintaining the global execution time. On the other hand, RIA pursues energyconservation from a totally opposite approach; specifically, this strategy generates periods of inactivity during the execution of the DAG by executing all tasks at the highest frequency, and relies on the energy savings attained via a reduction of frequency operation during these idle periods. In the end, both alternatives leverage the trade-off between energy and performance. In the paper, the results from these techniques are then fed to a simulator, which is used to produce a feasible schedule of the tasks as well as tune their execution frequencies to a particular target architecture, assessing the benefits that can be obtained for a given operation.
We have evaluated these two energy-control policies using two algorithms of the QR factorization which are representative of many other high-performance Level 3 BLASbased dense linear algebra operations. The results of this experimental analysis under realistic conditions show a reduction in energy consumption under certain conditions, and some interesting insights. First, they show the superior performance of the RIA policy over the SRA one from the point of view of execution time for problems of large dimension. Second, SRA, which is a more elaborate strategy than RIA, can potentially deliver higher energy savings than RIA under certain circumstances. Third, the study illustrates the importance of selecting the appropriate energy-saving policy, depending on the algorithm, problem dimension and target architecture. Finally, the results demonstrate the impact that flexible hardware which allows operating with DVFS frequencies at the core level (instead of the socket level) has on the energy savings.
We plan to address several open questions as part of future work. First, scheduling heterogeneous tasks with dependencies among them in environments with limited number of resources is known to be an NP problem; therefore, efficient new heuristics, tuned for the particular conditions of our problem, can have a considerable impact on the results. Second, our frequency variation strategy is static, deciding the task frequencies in advance; this should be changed into a dynamic policy, which operates at run-time, dynamically adapting to variations on the conditions. Third, our ultimate goal is to integrate the results from this research with a practical run-time scheduler for dense linear algebra operations.
Enrique S. Quintana-Ortí Quintana-Orti received his bachelor and Ph.D. degrees in Computer Sciences from the Universidad Politecnica d e Valencia (Spain) in 1992 and 1996. Currently he is professor in Computer Architecture in the Universidad Jaume I of Castellon (Spain). He has published more than 100 papers in international conferences and journals, and has contributed to software libraries like SLICOT and libflame. His research interests include parallel programming, linear algebra, power consumption, as well as advanced architectures and hardware accelerators.
