Abstract-Computing has recently reached an inflection point with the introduction of multicore processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores; however, in several domains, users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications and a runtime system that uses live program analysis to optimize applications dynamically. We describe a dynamic phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8 percent, simultaneous with an improvement in performance of 17.9 percent, resulting in energy savings of 26.7 percent.
INTRODUCTION
M ICROPROCESSORS crossed an inflection point with the introduction of multicore architectures. Clock rates and instruction-level parallelism have been replaced by the number of execution cores as the key metric that characterizes the performance and drives the marketability of a computer system. Moore's law is now interpreted as "the number of cores on a microprocessor is expected to double every one to two years," and hardware vendors race to pack more cores on a single chip [23] , [28] .
In the new landscape of highly parallel microprocessors and system architectures, system software appears to be largely unprepared for the transition. The programming effort required for parallelizing and optimizing code practically remains an unresolved issue, even among research communities that have been investigating this problem for decades. At the same time, power dissipation is now a major consideration for system software optimization on parallel architectures [12] , [13] , [14] , [15] . The introduction of many simple cores on a microprocessor has been largely motivated by the poor power efficiency of microarchitectural components that attempt to improve performance at the cost of hardware complexity and reliability [4] . Concurrency not only improves power efficiency but also helps system software steer power and performance simultaneously. The conventional wisdom holds that when concurrency is increased, performance is improved but with an associated increase in power consumption. Conversely, when concurrency is decreased, power consumption is reduced at a cost for performance.
Although there are many situations where it is desirable to trade performance for reduced power consumption, in the domain of high-performance scientific computing, performance remains the primary target. Applications written for high-end computing systems create a challenge for energy-aware system software, which needs to identify opportunities to reduce power consumption with a nonnegative impact on performance. For example, dynamic voltage and frequency scaling (DVFS) is a well-known technique for reducing the dynamic power consumption of a microprocessor in applications with extensive idle time. In well-tuned heavily optimized scientific applications, reduced idle periods and memory latencies may limit the degree to which DVFS can be exploited for energy savings. On the other hand, there are certain cases where inherent program characteristics (such as limited algorithmic concurrency, fine computational granularity, and frequent synchronization) and architectural properties (such as capacity limitations of shared resources) limit the scalability and the maximum degree of exploitable concurrency in an application, resulting in an observed performance loss through the use of more parallelism. In these cases, power and performance can be simultaneously improved by throttling concurrency.
To motivate the work presented in this paper, Fig. 1 shows a breakdown of the parallel execution time of three applications from the NAS Benchmarks Suite [20] into phases. The breakdowns were obtained during the execution of the benchmarks on a quad-processor server with Intel Xeon processors using the Hyperthreading technology. Each chart depicts the ðprocessors; Hyperthreads=processorÞ configuration that minimizes the execution time of each phase. The fastest configuration is identified experimentally by executing each target phase in all possible hardware configurations of the system. LU-HP-B, SP-A, and MG-B execute optimally with at least one Hyperthread per processor deactivated, thus saving power while simultaneously improving performance, during 95 percent, 84 percent, and 81 percent of their parallel execution times, respectively. LU-HP-B and SP-A execute with at least one entire processor deactivated during 40 percent or more of the optimal execution time.
Despite its appeal, concurrency throttling is an opportunity that may present itself to varying degrees across programs, across phases of the same program, or even across inputs to the same program. Identifying concurrency throttling opportunities statically is hard, because it requires fine-grained analysis of the dynamic behavior of parallel code across and within parallel execution phases. Aside from the problem of identification and quantification of the opportunities, applying concurrency throttling directly in applications requires the exposure of the programmer to architectural details such as the number and physical layout of processors, which is widely considered as one of the factors that make parallel programming exceptionally difficult [11] . Given the complexity and the inherent drawbacks of delegating concurrency throttling decisions to the user or to a static analysis tool, runtime systems appear to be ideal candidates for the identification and exploitation of concurrency throttling opportunities.
This paper presents the Adaptive Concurrency Throttling Optimization Runtime (ACTOR) system, which seeks the optimal operating point of concurrency in multithreaded programs at the granularity of program phases. In contrast to concurrency throttling schemes based on live empirical search of operating points, ACTOR relies on a novel dynamic phase-aware performance prediction (DPAPP) model. The model predicts the optimal operating point of concurrency on different configurations of processors, cores, and threads, hereafter referred to simply as hardware configurations, through the statistical analysis of hardware event rates. To the best of our knowledge, our methodology is the first to provide a performance prediction of changing concurrency levels and thread placement to an application at runtime. The key contribution of the DPAPP model is that it enables drastic reduction of the overhead associated with searching the optimization space for concurrency throttling.
We use a multivariate regression process for selecting critical hardware events and for training the DPAPP model in assessing the scalability of a program phase across different hardware configurations. The DPAPP training process derives distinct predictors for thread-level, corelevel, and processor-level parallelism to account for the presence of multidimensional parallelism and variance in the impact of resource sharing between threads within and across chip boundaries. We use the DPAPP model to steer our runtime concurrency throttler, which succeeds in identifying phases where power consumption can be conserved while sustaining or improving performance. ACTOR operates by controlling the execution of the application, with the first few iterations of the dominant phases of the application executed under specific hardware configurations, whereas selected hardware event counters (HECs) are sampled. After the sampling period, DPAPP is invoked to predict phase performance across configurations, and the remaining executions of each phase are executed with the decided-upon optimal configuration. We demonstrate the effectiveness of ACTOR by using the NAS Parallel Benchmark suite on a multiprocessor with multiple SMT processors.
The rest of this paper is organized as follows: In Section 2, we discuss background and related work. Section 3 introduces our model for DPAPP of parallel applications. Section 4 presents our control scheme for dynamic poweraware and performance-aware concurrency adaptation of multithreaded codes. We present a detailed discussion of our experimental methodology and results in Section 5. We conclude this paper in Section 6.
RELATED WORK
Substantial previous research has been performed on optimizing the execution of programs using feedback from HECs; however, it has predominantly been offline, profile guided in nature, for example, NUMA multiprocessor page Fig. 1 . Breakdown of the parallel execution time of the three applications from the NAS Benchmarks Suite on a four-processor server with Intel Hyperthreading processors. Each phase is represented with a gray rectangle. The length of the phase and the hardware configuration (number of processors, number of Hyperthreads/processor) that minimize the execution time of the phase correspond to the width and height of each rectangle, respectively. placement using hardware assistance [32] , CPO from IBM, which includes the management of variable page-size systems [5] , and case studies of specific applications [2] . In contrast, little work has been done on runtime optimization utilizing hardware counters as the program executes. Existing examples include HEC-based SMT job schedulers [34] and the ADORE runtime optimization system [31] . Our work falls into the category of online dynamic optimization with feedback from hardware counters; however, it targets energy consumption in addition to performance.
Performance prediction of parallel programs has been studied in great depth; however, the majority of research is targeted at offline prediction. Work most similar to ours includes offline research on partial execution-based prediction [40] and statistical simulation of superscalar processors using IPC predictions based on very short code samples [10] . Minimizing the design space evaluation time for processor development has spurred much research on predicting the performance effects of altering various microarchitectural parameters, including regression-based [29] and machinelearning-based approaches [18] , [21] . One important distinction with previous work is that once we perform training, the model can be applied to any desired applications, whereas many other approaches perform training and prediction for a single application [18] , [21] , [29] . To the best of our knowledge, no prior work has considered online predictors of parallel execution performance on shared-memory architectures, using runtime input on IPC and HECs.
High-performance power-aware computing has recently become an important topic of research. Efforts range from power-scalable and power-efficient clusters [12] , [13] to runtime systems providing support for dynamic frequency and voltage scaling for parallel applications [14] , [25] . Our work is most closely related to the latter, as both attempt to identify opportunities at runtime to achieve power savings without sacrificing performance. Our work differs in that we target shared-memory, rather than distributed-memory, multiprocessors. It should be pointed out that DVFS and concurrency throttling are not necessarily at odds with each other, as they may be applied in a synergistic fashion to achieve still greater energy efficiency [30] .
Concurrency throttling has been previously applied for the optimization of multithreaded codes on shared-memory multiprocessors. Specifically, concurrency throttling can enable adaptive execution in multiprogramming environments [1] , [38] , [41] . Furthermore, stand-alone programs can benefit from concurrency throttling across different phases with potentially different execution and scalability characteristics [17] , [42] . In most cases, concurrency throttling is applied in a given phase by the programmer, the runtime system, the hardware, or the compiler. Balasubramonian et al. [3] have considered hardware-based approaches to balance communication and parallelism by throttling the use of clusters on clustered microprocessors. Compilerbased control is generally performed using a simple threshold-based strategy, and the parallel code region is either sequentialized or run with a programmer-specified fixed number of threads [17] , [22] , [39] . Programmers have long had the ability to manually specify concurrency levels; however, few runtime systems provide the functionality to autonomically manage these decisions from within. Our work provides such a system, offering fully autonomic concurrency throttling based on the performance predictions of each configuration.
Recent work has considered applying concurrency throttling and DVFS on single-chip multiprocessors, with decisions utilizing the search algorithms of the configuration space [30] . This research shares many motivations with our work; however, the suggested solutions to the problem differ significantly. First, we do not explore the potential of DVFS but rather introduce a solution that works on architectures, independent of their support for DVFS. Second, our approach is implemented on a real system rather than a simulated one, verifying that our technique works in practice, with all overheads considered. Third, we utilize the performance prediction, rather than empirical searches, of the configuration space to reduce the number of test executions necessary to perform adaptation. Furthermore, we show that the overhead of search-based techniques hinders the performance of short-lived codes, particularly when compared to prediction. Additionally, our approach targets multiprocessor systems where the combined energy consumption of the processors plays a much larger role than in uniprocessor systems such as that evaluated in [30] .
Springer et al. [37] propose an approach to identify the number of nodes in a cluster and DVFS level to meet a user-specified energy budget. The authors target clusters, where application scalability is considerably better than on SMPs, and thus, they do not attempt to improve performance through adaptation. On the other hand, we exploit poor scalability on SMPs to improve both the power and the performance of an application simultaneously. Additionally, their approach requires multiple offline executions of the target application, whereas we perform allapplication-specific analysis with minimal overhead during live executions. However, the two approaches could be applied together to determine the optimal concurrency to use per node.
DYNAMIC PHASE-AWARE PERFORMANCE PREDICTION
The goal of DPAPP is to predict the performance of a multithreaded computationally intensive region of code in a program, which we hereafter refer to as a phase, across varying configurations of the processing units on a parallel architecture [6] . We use the term processing units as an umbrella term covering hardware threads, processor cores, or entire processors. As a base hardware substrate, we consider shared-memory multiprocessors with three distinct types of processing units, namely, multicore processors, cores within processors, and threads within cores. We refer to each of these types of processing units as a dimension of parallelism in the system. The dimensions of parallelism that we consider are representative of current commercial multiprocessors [23] , [28] . Our DPAPP technique considers phases that are identified as parallel loops, as these structures encapsulate the bulk of parallel code in real scientific applications. Specifically, for the purposes of this work, we define phases to be OpenMP parallel regions.
Our DPAPP model works by predicting the cumulative useful Instructions per Cycle ðuIP CÞ of multithreaded phases. uIP C is defined as the sum of IPCs of the threads used to execute a phase, excluding instructions and cycles expended for synchronization and parallelization. Ignoring parallelization and synchronization overheads makes uIP C inversely proportional to the execution time of a fixed number of instructions on a given hardware configuration. Note that although uIP C ignores instructions for triggering and synchronizing threads, it still considers the effects of interference between threads on shared hardware resources during concurrent execution. The objective of DPAPP is to identify phases where concurrency can be reduced during the execution of useful application computation, with a nonnegative impact on performance.
DPAPP Outline
DPAPP makes distinct predictions of the optimal number of processing units to use at each dimension of parallelism in the system. For ease of presentation, we first describe the operation of DPAPP for a given dimension of parallelism d. We defer the discussion of how DPAPP predicts across dimensions of parallelism until Section 3.5.
DPAPP takes input from live samples of HECs. HECs are sampled at the beginning and end of each phase, whereas the phase is executed on the configuration that activates all processing units at dimension d. The set of hardware events sampled are specific to d and are selected using a formal statistical process according to their contribution to uIP C. We refer to these events as critical events. Samples of critical event rates are fed to a model that estimates uIP C per phase per configuration for all feasible configurations of processing units at dimension d. Intuitively, DPAPP attempts to predict how the rate of retirement of useful instructions uIP C will change in a phase when the number of processing units used to execute the phase changes. To make this prediction, DPAPP uses a multivariate regression model, which correlates observed event rates on a sampled configuration and observed uIP C values on all feasible hardware configurations during training runs. The model outputs a set of scaling factors for uIP C and the critical hardware events for each feasible hardware configuration. These outputs are used as constant coefficients during production runs to predict optimal operating points of concurrency for each phase in the code. We describe the model in more detail in Section 3.2 and the process for training the model in Section 3.3. The process for selecting critical events is discussed in Section 3.4.
The objective of DPAPP is to produce performance predictions and adapt the code dynamically as the program executes. Recall that the primary motivation behind DPAPP is the avoidance of the overhead of experimentally searching through hardware configurations to find optimal operating points for phases in the program. To minimize the prediction overhead and to achieve effective code adaptation as early as possible during execution, DPAPP samples HECs for a minimal number of phase traversals. Following phase traversals used for sampling hardware event rates, the runtime system selects the predicted optimal operating point of concurrency for each phase. 
uIP C Prediction Model
The DPAPP predictor estimates the uIP C of a phase on a target configuration t (denoted as uIP CðtÞ) by using input from the execution of the phase on a sampled test configuration s. The input from the sampled execution includes the actual uIP C of the sampled configuration uIP CðsÞ and a set of n hardware event per cycle rates ðe 1 ðsÞ; . . . ; e n ðsÞÞ. Each event rate e i ðsÞ, i ¼ 1 . . . n, is the number of occurrences of event i divided by the number of elapsed clock cycles during the execution of the phase in test configuration s. Although in theory, the DPAPP predictor can use any feasible configuration as a sample configuration, we heuristically chose to use the configuration where all processing units at the given dimension of parallelism are active. Intuitively, uIP C and the event rates sampled in this configuration encapsulate the cumulative impact of hardware components on scaling.
We model uIP CðtÞ of the target configuration, being a linear function of uIP CðsÞ of the source configuration, as uIP CðtÞ ¼ uIP CðsÞ Á t; e 1 ðsÞ; . . . ; e n ðsÞ ð Þ þ ðtÞ ð1Þ
for a set of n critical hardware events, which may function either as enhancers or as impediments of scalability. The selection of the events in this set is discussed further in Section 3.4. Notice that both the scaling factor and the constant term of the linear function are specific to and dependent on the target hardware configuration t. In other words, each target configuration t exerts its own scaling impact on uIP CðsÞ, which can be positive or negative. To gauge how individual critical events affect scalability, the linear scaling factor is, in turn, modeled as a linear combination of hardware event rates observed during the sampled configuration s as follows:
t; e 1 ðsÞ; . . . ; e n ðsÞ ð Þ ¼ X n i¼1 x i ðtÞ Á e i ðsÞ þ y i ðtÞ ð Þ þ zðtÞ: ð2Þ
The linear model of event rates stems from the empirical observation that a change in the configuration used to execute a program phase will result in changes, either upwards or downwards, of critical hardware event rates, reflecting the contention or effective hardware utilization at each level of parallelism. These event rates are related positively or negatively with the uIP C, and this relationship can be accurately represented using a linear model [26] , [33] . We capture this relation in (2) with positive or negative event coefficients, respectively. Our model attempts to estimate these coefficients by using multivariate regression, discussed further in Section 3.3. The advantage of such an empirical model is that it is hardware agnostic; that is, it can be retrained for arbitrary architectures without requiring detailed user-provided domain knowledge about the processor.
Combining (1) and (2), the estimated uIP C for a target configuration t can be calculated as uIP CðtÞ ¼ uIP CðsÞÁ X n i¼1 x i ðtÞ Á e i ðsÞ ð ÞþuIP CðsÞ Á ðtÞ
where ðtÞ is defined as P n i¼1 ðy i ðtÞÞ þ zðtÞ. An accurate estimation of uIP C for a target configuration t is thus dependent on the proper approximation of the coefficients x i ðtÞ, ðtÞ, and the constant ðtÞ. Note that the coefficients scale both the event rates and uIP C of the sampled configuration s.
uIP CðtÞ values for all possible configurations are used directly for the prediction of the optimal operating concurrency for each phase at the given dimension of parallelism. We truncate uIP C predictions that exceed the cumulative maximum capacity uIP C max of all processing units at the given dimension of parallelism to uIP C max , which is derived experimentally for any given processor using microbenchmarks. Furthermore, we assume that there is no superlinear speedup across the configurations of a phase, although this case appears in real codes. In practice, phases with superlinear speedup have their optimal operating point of concurrency at the maximum number of processing units and offer no opportunity for concurrency throttling.
Offline Training and Estimation of Coefficients
We use multivariate linear regression on the multithreaded phases of a set of training benchmarks to determine the values of the coefficients in (1). Although more advanced machine learning techniques could be deployed for prediction, the number of cycles invested in making predictions at runtime is a primary concern for DPAPP; therefore, we opt for the simplest linear prediction model. Specifically, training benchmarks are executed under all feasible hardware configurations at all dimensions of parallelism while recording per phase uIP C and the critical hardware events used for prediction (see Section 3.4). The training benchmarks are selected empirically so as to include phases with variance in three characteristics: scalability ranging from poor to perfect, the granularity of parallel computation, ranging from fine to coarse, and the ratio of computation to memory accesses, ranging from low to high. Through this process, patterns in the effects of event rates on scalability are learned statistically, resulting in high accuracy when applied online.
Our multivariate regression analysis uses the events collected under the selected sample configuration s multiplied by the uIP C of the sampled configuration, that is, e i ðsÞ Á uIP CðsÞ, and the actual uIP C alone uIP CðsÞ as independent variables to predict the uIP CðtÞ of each target configuration t as the dependent variable. We use the product of e i and uIP C of the sampled configuration for coefficient derivation, because our model uses multiplicative effects of events on the observed uIP C rather than additive ones, in accordance with (3) . This process estimates the necessary coefficients for each event in function ðtÞ. Regression analysis is performed separately to predict uIP C for each target configuration t; therefore, we derive independent sets of coefficients and independent scaling factors for each target configuration. For a system with p d units in dimension d of parallelism, 1; . . . ; D, multivariate regression analysis derives a total of P D i¼1 p d sets of coefficients.
Rigorous Event Set Selection for uIP C Prediction
The accuracy of DPAPP is heavily dependent upon the selection of an effective set of critical events for predicting performance and scalability along each dimension of parallelism. The events should accurately reflect, in a statistical sense, performance and scalability bottlenecks in the system. We have previously considered empirical selection of events that represent known performancecritical components of microprocessors [6] . In this paper, we present a rigorous statistical technique, which automates the event selection process and makes it reproducible and generally applicable to any target architecture. Modern processors generally provide very large sets of events that can be recorded, of which multiple can typically be recorded at the same time. For example, Intel Pentium 4 provides 40 events, which can be further differentiated by specifying bitmasks to each event, and up to 18 events can be recorded at once. The IBM Power5 provides 500 events and permits up to 6 to be recorded simultaneously. The number of legal sets of events that can be recorded simultaneously on these architectures is far too large for it to be feasible to exhaustively test each set of events as input for prediction. Moreover, although the most effective prediction possible would likely result from the use of all (or at least most) available events, there is an architectural limit on how many events can be recorded simultaneously.
Rather than exhaustively looking at each possible combination of events, our predictor training tool independently looks at the contribution of each event to uIP C. To gauge each event's significance, we initially use multivariate regression on data from the set of training benchmarks to predict uIP CðtÞ for each target configuration by using all events that are available for monitoring on the processor. We model uIP C as in (3), with the exception that we use a set of N events, where N ) n.
Following the initial uIP C modeling phase, we prune all events that have zero or negligible occurrence rates. We then consider the contribution of each event to the resulting uIP CðtÞ prediction as a percentage of uIP CðtÞ. The contribution of each event is calculated by multiplying the event rate by its coefficient and by uIP CðsÞ and dividing the result by uIP CðtÞ. We average the contributions of each event across all feasible configurations and all phases in the training runs and rank the events in descending order of contribution. The actual number of events selected for prediction n is processor dependent. We set n to be the maximum number of events that the hardware performance monitor of the processor can count simultaneously without time multiplexing of event registers. This selection criterion minimizes the overhead of monitoring hardware events for prediction.
Prediction on Architectures with Multiple Dimensions of Parallelism
On architectures with multiple dimensions of parallelism, resource sharing varies considerably across dimensions. For example, physical processors in an SMP share only the offchip interconnection network and DRAM. Cores within a processor typically share an on-chip interconnection network and the outermost levels of the on-chip cache. Threads on a single core share most resources of the execution core, including pipelines, branch predictors, TLB, and L1 cache. Contention for these shared resources is largely responsible for performance and scalability.
To capture the implications of multidimensional parallelism, DPAPP uses a distinct set of critical events and derives a distinct set of scaling factors for each dimension of parallelism in the system. DPAPP repeats the processes outlined in Sections 3.2 and 3.4 to obtain prediction event sets and coefficients for each dimension of parallelism. At actuation time, DPAPP makes predictions along each of the dimensions of parallelism and combines these predictions to yield a power-efficient concurrency operating point for each phase in the program.
Predictor Optimization
The accuracy of DPAPP is significantly improved by classifying code phases into buckets according to their observed uIP C during the execution of the sampled configuration. The justification for such an extension is twofold. First, grouping phases based on uIP C allows training and prediction to occur separately for phases with different scalability slopes. As such, the division between buckets is selected such that it divides different degrees of scalability. Second, it is intuitive that the effects of events will vary, depending on the original instruction throughput of each phase. Dividing the phases into buckets and creating separate ðtÞ scaling functions for each class of phases gives the predictor the opportunity to make more fine-grained and accurate predictions. At runtime, the observed uIP C on the sample configuration determines which set of coefficients will be used for prediction. We use this optimization in our implementation of DPAPP.
CONCURRENCY THROTTLING FOR PERFORMANCE AND POWER OPTIMIZATION
In this section, we present our phase-aware concurrency throttling algorithm for a two-layer shared-memory multiprocessor such as a multichip multiprocessor with multicore processors. We then discuss the power and energy reduction potential of the algorithm and extensions to the algorithm that take account for interphase interference.
ACTOR Runtime System
Scientific codes are dominated by iterative execution of phases, and ACTOR exploits this property to sample hardware event rates in the first few phase traversals and to set the concurrency of each phase to the predicted optimal operating point early during the execution of the program. The live search of the optimization space for operating points of concurrency can also be performed by timing phases at different configurations and running search heuristics such as greedy hill climbing [7] , [30] and simulated annealing [27] . However, as the number of feasible hardware configurations increases with the introduction of more cores and threads per processor, direct search methods may spend most of the execution time sampling suboptimal configurations rather than optimizing the program. This disadvantage manifests itself in codes where dominant multithreaded phases are traversed only a few times. Even if direct search methods are used for offline autotuning by repetitive executions of the entire program [11] , searching the optimization space for any input on any feasible configuration of processing units may be prohibitive. ACTOR prunes the search space for concurrency optimization to a constant number of samples. Fig. 2 illustrates a DPAPP-driven concurrency throttling algorithm in ACTOR for a multiprocessor with two dimensions of parallelism. The DPAPP-based concurrency throttling algorithm has two parameters, that is, the sampling rate and the dimension of parallelism, along which the initial samples are taken. The sampling rate S corresponds to the number of times each phase needs to be executed before deriving a prediction for the optimal operating point and is used to control the sampling overhead. In our prototype, we use a sample rate of S ¼ 2 taken along the innermost dimension of parallelism, that is, threads within a processor, which provides the minimum number of samples needed to capture the effects of using more than one core or thread per processor. The second parameter is fixed at the training phase of the DPAPP predictor, during which all possible orderings of dimensions of parallelism can be tested. The algorithm in Fig. 2 generalizes to more than two dimensions by repeating the loop in lines 11-17 for each dimension beyond the second.
The structure of the ACTOR system is given in Fig. 3 . The controller is dynamic in the sense that it adapts the program as it executes, with no prior knowledge of program characteristics. Currently, ACTOR requires simple formulaic instrumentation in the application; however, we plan to instead embed all functionalities within the threading substrate. ACTOR estimates optimal operating points of concurrency by using samples of critical hardware event rates from live executions of program phases. Specifically, the library controls the first S phase traversals to execute on the desired sample configurations and collect event rates, as shown in Fig. 4 . At the end of the sampling period, collected event rates are used by DPAPP to predict the uIP C of each phase on alternative configurations. Once predictions for a phase are obtained, all subsequent traversals of a phase are executed at the predicted optimal operating point of concurrency. ACTOR enforces configuration decisions through the Linux processor affinity system call sched setaffinityðÞ and threading library-specific calls for changing concurrency levels such as omp set num threadsðÞ in OpenMP. The library executes at the user level and so does not require administrator privileges. The overhead of using ACTOR in terms of the time spent not executing application code is approximately 500,000 cycles per program phase (250 ms on a 2-GHz processor), which is negligible for any realistic application.
Although both concurrency throttling and DVFS target improved energy efficiency, concurrency throttling has the advantage that it will often improve performance, whereas DVFS sacrifices performance to reduce power consumption. Furthermore, DVFS relies on program phases with high memory access rates to avoid degrading performance significantly, whereas concurrency throttling may be applied in other cases as well. In general, however, the two approaches are likely to be highly synergistic and can be applied together to achieve even greater energy efficiency. For example, DVFS could be applied using existing approaches to cores kept active by concurrency throttling. More sophisticated techniques could be devised to optimize both DVFS and concurrency; however, such a solution is beyond the scope of this work.
Certain assumptions are necessary to implement our concurrency throttling system, and we outline those in the following. First, we rely on the capability of the runtime system to change the number of threads used to execute a phase of parallel code at runtime. This capability is available in OpenMP at the granularity of parallel loops and parallel regions. However, changing the number of threads at runtime may not be possible in some applications due to data initialization, which depends on the number of threads used. This pattern is uncommon and is trivial to modify. Second, the phases of an application must be executed at least S times to allow for sampling. Finally, the execution properties of each phase between executions must remain relatively stable. In practice, this is the case in both regular and irregular codes.
Although we have specifically designed ACTOR for use with iterative scientific applications, the approach may apply to other categories of applications as well. The basic principle of ACTOR can be used with any definition of a phase where concurrency can be dynamically adjusted. For example, in noniterative, synchronization-intensive, or heterogenous multithreaded codes, if an existing phase identification technique can be employed to identify repetitive behavior, where concurrency is modifiable, then our approach can be applied. For server workloads, the application may be treated as one large phase, and a limited time frame can be monitored to decide concurrency for the entire application.
Energy Savings Possibilities
Energy savings using adaptive concurrency throttling come through two avenues: the first is by reducing the execution time, because the energy consumed is reduced proportionally, and the second is through the deactivation of processing units, which reduces power consumption. The power consumption of a processing unit is dependent upon its level of utilization, as clock gating limits the power dissipation of functional units when they are idle. Furthermore, a processor can be transitioned to a lower power mode when it is not being used. For example, on Intel Pentium 4 processors, the hlt instruction transitions the processor to a low-power mode, where power consumption is reduced from approximately 9 W when idle to 2 W when halted. Although we do not manually control the transitioning between power states of the processors from within the runtime system, the operating system does so when the processor remains inactive for some time period. We have experimentally verified that in Linux 2.6 kernels, processors are actually transitioned to the halted state during 90 percent of the time during which they have been left idle. Manually transitioning processors would result in minimal additional power savings, so we do not consider this direction further in this work.
Cross-Phase Decision Making
The processes of prediction, decision making, and adaptation are not performed at whole-program granularity; rather, each phase of an application is analyzed independently. This allows phases with different execution properties in the same application to execute with their own locally optimal hardware configurations. Since many programs have behavior that varies across phases [36] , the overall performance can be improved compared to using a single configuration for the entire program. However, a nonnegligible performance penalty may be paid as a result of changing the hardware configuration across adjacent phases at runtime. This performance penalty stems primarily from the migration of working sets of threads between caches [24] . To avoid negative interphase interference, we consider variants of our adaptation scheme that are aware of this interference. We have developed two schemes for cross-phase adaptation. The first of these schemes simply finds the configuration that is the best for the majority of the application's phases and applies this to all phases, regardless of their locally optimal configuration. This scheme avoids cache interference entirely at the expense of using a single configuration for all phases and missing fine-grained optimization opportunities. The second approach is an extension of the first, where phases are allowed to temporarily replace the global optimal configuration with their local optimal configuration only if IPC improvement beyond a preset threshold is predicted by using the local decision. Using this technique, interference will only be tolerated when the phase in question is expected to make up for it in performance gain through the use of an alternative configuration.
EVALUATION
In this section, we perform an evaluation of both the performance prediction model and the adaptive concurrency throttling technique presented in Section 4. In the Section 5.1, we present the experimental setup that we used in our evaluation. Then, we present the results of event selection for prediction and the resulting accuracy of the predictor. Finally, we compare the power and performance results of ACTOR with those attained by online techniques based on empirical search and by offline techniques using predetermined concurrency.
Experimental Setup
We performed all of our experimental evaluations on a Dell PowerEdge 6650 server equipped with four Intel Hyperthreaded Xeon processors with 1 Gbyte of main memory. Each processor is a 1.4-GHz two-way SMT equipped with an 8-Kbyte L1 data cache, a 12-Kbyte trace cache, a 256-Kbyte L2 cache, and a 512-Kbyte L3 cache. The operating system on the server is the Linux kernel version 2.6.15.
Experiments were performed with 10 benchmarks that are representative of scientific and engineering applications typically requiring high performance. Nine of the benchmarks originate in the OpenMP version of the NASA Advanced Supercomputing Parallel Benchmarks suite version 3.1 [20] . We use three different problem sizes available in the NAS distribution: W, A, and B. MM5 is an OpenMP implementation of a mesoscale weather prediction model [16] . The benchmarks include a wide variety of program properties, in particular widely varying uIP C scalability across execution phases. Therefore, they are challenging targets for prediction. The benchmark suite includes several benchmarks with a small number of iterations CG, FT, IS, and MG, in which empirical search strategies may suffer due to a large percentage of the total execution time being spent in exploration, as well as benchmarks with a large number of iterations BT, LU, LU-HP, SP, UA, and MM5, where search strategies stand to have their search overheads better amortized. Results for FT are not included for class size B, because its working set does not fit in the available memory of our hardware platform. Table 1 lists the benchmarks, along with some pertinent information about their structure. The number of iterations, phases, and percentage of time spent in parallel regions shown are for class size A. The table also outlines the percentage of execution time during which at least one processor can be deactivated with nonnegative impact on performance (that is, the program runs optimally with at most three processors) and the percentage of execution time during which one Hyperthread per processor can be deactivated with nonnegative impact on performance (that is, the program runs optimally with at most one Hyperthread per processor) averaged over all three class sizes. This information is taken from static executions on all feasible hardware configurations.
Performance Prediction Evaluation
In order to evaluate our performance prediction model, we selected two benchmarks for training, specifically UA (compiled to class size A) and MM5. These benchmarks were selected, because the phases that they contain have widely varying execution properties, including IPC, scalability, and locality. Furthermore, they contain enough phases, specifically 119, to serve as a stand-alone training set. These applications were used in the event selection process and the predictor training. Predictions were made for the remaining benchmarks, that is, all remaining NAS benchmarks with class sizes W, A, and B. Sample configurations of one and two threads active on all four processors were selected as input to predict for configurations with fewer processors active. As a result, predictions were made for a total of six configurations.
Event Selection
Selection of an effective set of events to use for performance prediction requires data for all of the available hardware counters on each of the test configurations for all of the training benchmark phases. Furthermore, the uIP C values of all phases on each hardware configuration are necessary as well. There are 40 events on Pentium 4 processors that can be recorded using only a single register each, with further differentiation within each event through the use of bitmask parameters specifying, for example, to record L2 cache misses, hits, or accesses. There is also an event for counting memory accesses, which requires two counter registers. We select one bitmask for each event representing the hardware parameter most likely to have the largest effect on performance, leaving 41 events to consider. Of these, 13 had rates near zero and were thus removed, as described in Section 3.4. The performance-monitoring unit of the Pentium 4 with the Hyperthreading technology shares the 18 counter registers between the two coexecuting threads, leaving nine counters available for each thread. The 28 events that survived pruning provide a total of 99,372 possible architecturally legal sets of events that can be recorded on the nine performance counter registers per thread. Regression analysis was performed on the data from each phase to find the events that contributed the most to the resulting IPC prediction. Table 2 displays the set of events that was selected for prediction from each sampled configuration on our platform. In this discussion, configuration nproc; nthr=proc denotes a configuration with nproc processors and nthr=proc threads per processor. It should be pointed out that events with large contributions have been excluded due to conflicts with more dominant events. That is, the inclusion of one highly contributing event often eliminates other contributing events that interfere with it. All that can be done in these cases is to select the event with the largest contribution and ignore the conflicting events.
Specifically, three of the top-five events on this architecture cannot be included, because they conflict with the top-two events. This suggests that on architectures where there are no dependencies between events, our prediction approach will likely achieve higher accuracy.
Prediction Accuracy
We perform our evaluation of the accuracy of the online performance predictor by using eight of our 10 benchmarks, excluding the two benchmarks used for training the predictor. We consider the absolute prediction error and the configuration prediction error for each benchmark. We calculate the absolute prediction error as juIP C pred À uIP C obs j=uIP C obs , where uIP C obs is the observed IPC of useful instructions. The average prediction error for each phase is taken across all target configuration predictions. Configuration prediction accuracy illustrates how often the predictor identifies the local static optimal configuration, which is defined as follows: We execute the benchmarks with each of the eight possible hardware configurations statically, that is, with no concurrency throttling between phases. For each phase, we designate as optimal the configuration that minimizes the execution time of the phase. We should note that the litmus test for our predictor is not uIP C prediction accuracy but configuration prediction accuracy. As long as the predictor consistently predicts the optimal configuration correctly, a potentially high uIP C prediction error can be disregarded.
As discussed in Section 3.6, we utilize phase classification before making predictions. Specifically, we divided phases into two buckets, with uIP C greater than or equal to 1.0 and those less than 1.0 during the sampled configuration. This division is not arbitrary; rather, it provides an approximate value to separate phases with low scalability characteristics versus those that scale well on our experimental platform. During prediction, each phase uses the coefficients derived from the uIP C bucket corresponding to its observed uIP C during the sampled configuration.
The uIP C prediction accuracy can be seen in Fig. 5a . This graph gives the cumulative distribution function of prediction error, that is, the percentage of phases that experience error below each threshold, with threshold samples taken every 5 percent. The median absolute prediction error is 12.6 percent. We note that 24 percent of all predictions have less than 5 percent error and 43 percent of all predictions have less than 10 percent error. On the other hand, only 4 percent of the predictions show errors larger than 50 percent. Although our performance prediction model is purposefully simple to minimize the overhead of applying it at runtime, its results compare favorably with other reported statistical techniques for predicting IPC [9] . The high accuracy of the model stems from the use of statistically selected event rates, which allows predictions to be made based on detailed knowledge of the utilization of specific critical processor resources where programs spend most of their execution cycles. Trends in the relationship between the usage of these resources and the resulting scalability are learned offline through statistical analysis of the training set, so an accurate model is achieved, because the training phase captures a wide range of scalability-event correlation patterns. In terms of the prediction of the optimal configuration for each phase, Fig. 5b shows the percentage of phases for which each possible ranking of configuration was selected. This value is calculated by sorting the configurations by IPC for each phase and identifying which entry was selected by the predictor. For example, a value of 1 indicates that the best configuration was selected, 2 indicates that the second best configuration was selected, etc. This graph shows that in 64 percent of phases, the single best configuration is identified by the predictor. An additional 19 percent of phases have the second best possible configuration selected. This evaluation shows that optimal configuration identification occurs at a higher rate than what might be expected from the error rate reported. The observed success rate can be partially attributed to the fact that the predictors tend to consistently overpredict or underpredict uIP C by similar margins across configurations for any given target phase. Therefore, the uIP C prediction error does not prevent correct ranking of configurations.
As a result of the high configuration prediction accuracy, the performance loss in mispredicted regions is usually quite low. Fig. 5c shows the weighted performance loss observed for each benchmark during mispredicted phases. This value is calculated as P N B i¼1 w i Á D i , where N B is the number of mispredicted regions in benchmark B, w i is the weight of each mispredicted region expressed as the percentage of the total parallel execution time of B that the specific region accounts for, and D i is the absolute performance penalty suffered by the mispredicted region i. The average penalty across benchmarks is only 1.2 percent. The explanation for the negative performance loss (performance gain) of LU-HP is that by not changing configurations to the optimal in all cases, the cache effects of altering configurations are reduced. These results show that our model is capable of identifying the optimal configuration most of the time, and when it does not, it still manages to find a competitive configuration to use, with minimal performance penalty.
Adaptive Concurrency Throttling Evaluation
To measure the power consumption of the benchmarks under various hardware configurations, we utilize a power measurement methodology based on HECs [19] , which has been proven to be highly accurate. This methodology works by first partitioning the processor into components and then determining the maximum power consumption of each component based on the die area that it consumes. The runtime power consumption of each component is the maximum power adjusted by an activity factor. The latter is estimated by looking at corresponding HECs. This amount is added to a nongated clock power associated with each component, which grows nonlinearly with activity. The power consumptions of all components are summed along with a constant base idle power. Additionally, we monitor the number of cycles during which the processor is halted, and we only charge an associated halted power in these cases. It should be noted that we focus only on processor power consumption. For the welltuned scientific applications that we consider in this paper, processor power is the dominant portion of the total system power consumption [35] . Fig. 6 depicts the execution times and energy consumption of each benchmark under class size A for each static configuration. Static configurations use a single configuration for the entire execution. These graphs show that on our experimental platform, very little additional performance gain is seen through adding additional processors once two processors are active. The IS benchmark is particularly interesting, which sees its best performance by using a single thread on only one processor. Furthermore, sometimes, there is a large gain through using the second execution context on each processor, but sometimes, there is a substantial loss. For these reasons, the adaptation of the number of processors and execution contexts stands to improve both execution time and power consumption. It can be observed that although performance levels out, the energy consumption increases at rather steep rates with more processors. The reader may note that the observed scalability bottlenecks are an artifact of hardware bottlenecks such as limited memory bandwidth. Although this statement is correct, it also reflects a property of a large number of real systems, including state-of-the-art platforms that outdate our experimental system. For example, we performed experiments with the NAS benchmarks on a newly released quad-core Intel processor (Q6600), which have shown that applications still tend not to scale well on even the latest hardware. In particular, several of the benchmarks fail to scale beyond two cores, with maximum speedups saturating well below 2 (see Fig. 7 ). As a result, opportunities for concurrency throttling still exist, even in the newest hardware platforms.
Motivating Examples
As further evidence of the importance of phase-level adaptation, Fig. 8 displays the IPCs for each phase of the LU-HP benchmark at class size B under each static configuration normalized by the IPC of (1, 1). It is evident in the chart that a single application can have optimal configurations varying greatly between phases. LU-HP, in particular, experiences five different optimal configurations across different phases. Therefore, using a technique to execute each phase at its local optimal operating point stands to improve performance. In cases where the optimal configuration occurs on fewer than the available number of processing elements, power savings can occur during the execution of these phases. The goal of our adaptation approach is to exploit these properties with no a priori knowledge of the codes and achieve both power and performance benefits.
Offline Adaptation Strategies
Before discussing the online adaptive strategies and their results, we focus on two offline approaches to adaptation. The first of these, static optimal, uses the single programwide static configuration that results in the lowest execution time. The static optimal configuration for an entire program differs, in general, from the static optimal configurations of phases in a program. The second approach is phase optimal and uses the local static optimal configuration, not considering cross-phase effects, as defined earlier. Due to interference occurring by changing the configurations in phase optimal, the mean execution time of the benchmarks is 1.0 percent higher than static optimal. For this reason, we limit our following evaluation to comparing adaptive strategies to static optimal.
The two offline approaches that we consider have the disadvantage that the optimal configuration may change with different input sizes. For example, IS executes statically optimally on (3,1) for class size W, but (1,1) and (2,1) for class sizes A and B, respectively. For individual phases, the optimal configuration varies by problem size as well. Specifically, only 52.5 percent of the program phases in our benchmarks experience the same optimal configuration, regardless of the input size. This means that the use of these static techniques requires offline analysis that is specific to the application and the input size. In contrast, the online adaptive approaches adapt autonomically at runtime for the current application execution and require no application-specific/input-size-specific offline analysis.
Empirical Search-Based Strategies
For purposes of comparison, we have implemented two alternative dynamic adaptation strategies based on the empirical search of the configuration space at runtime. The first of these is the most straightforward form of adaptation, that is, exhaustive search, where each possible configuration is tested, and the one that provides the lowest execution time is selected for each phase. The second empirical search technique that we implemented is a heuristic search algorithm, which we have previously devised to reduce the overhead of exhaustive search [7] . This algorithm works by applying a hillclimbing heuristic search to find the optimal number of processing elements to use at each dimension of parallelism, one dimension at a time. The algorithm begins by executing the phase on all available processors, with all Hyperthreads active. Then, the number of processors is successively reduced until an increase in execution time is observed. The lowest number of processors that results in a decrease in execution time is used for the corresponding phase. This process is then repeated on the decided-upon number of processors to determine the number of Hyperthreads to use on each processor. Fig. 9 illustrates the normalized arithmetic means of three metrics: execution time, average power consumption during execution, and energy consumption. These metrics are derived for each benchmark under different execution strategies. Each metric is first normalized to the corresponding metric of the (4,2) configuration for the specific benchmark, which exploits all available execution contexts on our experimental platform. We then calculate the means of the metrics for each benchmark.
As can be seen from Fig. 9 , the average execution time of all benchmarks over all problem sizes using exhaustive search was reduced by 10.9 percent compared to statically using all available processors and execution contexts on the system. Power is reduced by 9.7 percent as well, resulting in a 19.5 percent reduction in the total energy consumption. However, this approach incurs high overhead in the exploration phase due to its testing of each configuration. Exhaustive search needs to execute eight iterations of each phase to reach a decision. This overhead shows up when the results are compared to using the optimal static number of threads for the entire program execution, where exhaustive search is outperformed by 16.1 percent overall and by 31.6 percent in benchmarks with a small number of iterations, that is, MG, CG, FT, and IS. However, for applications with many iterations BT, SP, LU, and LU-HP, exhaustive search is able to come within 1.1 percent of the static optimal in terms of performance while reducing power consumption by 3.3 percent, because the search overhead can be amortized over a large number of iterations.
Using hillclimbing reduces the number of required test iterations for each phase to 5 in the worst case for our experimental platform and only 3 in the best case. This overhead reduction allows the hillclimbing algorithm to achieve improved performance compared to exhaustive search, because a larger percentage of the iterations will be executed with the decided-upon optimal configuration rather than testing additional suboptimal configurations. Specifically, compared to exhaustive search, hillclimbing achieves a 1.6 percent improvement in execution time overall and a 3.9 percent improvement for applications with few iterations, with a minor 0.5 percent increase in execution time for the applications with many iterations. The slight performance drop in applications with many iterations can be attributed to occasionally selecting slightly worse configurations than exhaustive search. Power consumption is reduced by 1.7 percent, and energy consumption is reduced by 3.6 percent, on the average, compared to exhaustive search. Compared to static optimal, hillclimbing reduces the performance loss to 26.5 percent for applications with a small number of iterations and to 13.9 percent overall. These results show that hillclimbing is able to reach good configuration decisions while requiring fewer exploration iterations, thus introducing less overhead. However, the search overhead is still a factor for applications with few iterations.
Occasionally, power consumption actually increases through the use of adaptation. This result is counterintuitive, since adaptation always maintains or reduces the number of processors and Hyperthreads used. In the large majority of phases, specifically 79 percent, deactivating Hyperthreading reduces power. However, in certain cases, the use of Hyperthreading causes severe destructive interference in the cache between coexecuting threads, which can increase stall time and, therefore, reduce dynamic power consumption [8] . Deactivating Hyperthreading in these cases increases power consumption by reducing stall time; however, energy is reduced due to the reduction in execution time.
Performance-Prediction-Based Adaptation
Through the use of performance prediction, the number of iterations required for adaptation can be further reduced using the algorithm presented in Section 4 to only two iterations in the case of our experimental platform. Furthermore, performance prediction reduces the effects Fig. 9 . Performance of the adaptation strategies in terms of execution time (first group of bars), power (second group of bars), and energy (third group of bars) normalized with respect to the (4, 2) static configuration for each benchmark, averaged over all class sizes.
due to changing configurations during the exploration process that can lead to suboptimal decisions by the directsearch strategies. On the downside, uIP C predictions need significantly more processor cycles than direct comparisons of the execution times of phases.
We first compare a strategy whereby the predicted optimal configuration for each phase is used to strategies that consider cross-phase analysis to make decisions. The best strategy is selected for use with ACTOR and is compared to the offline and direct-search approaches already presented. First, we evaluate our approaches to minimize the harmful effects of using the local optimal configuration for each phase, which occur if changes in the configuration of adjacent phases result in redistribution of working sets between caches [24] . Our experimental results, as shown in Fig. 10 , indicate that simply attempting to avoid cache interference is not inherently effective. Using an approach whereby the configuration selected as the best for the majority of execution time (that is, the dominant configuration) is enforced for all phases causes a slowdown of 1.5 percent compared to the local optimal approach, with an additional 0.9 percent energy consumption. This happens, because in many cases, the benefits of executing a phase with its local optimal configuration outweigh the performance loss suffered as a result of cross-phase interference.
Based on these results, we developed an intermediate adaptation scheme that uses a global dominant policy for most phases, excluding those expected to see substantial performance gains by using their own local optimal. In particular, using this approach, the global decision is enforced, unless a given phase expects at least a 15 percent performance gain, which we experimentally verified to be enough to outweigh the cache effects of changing configurations. When compared to phase-local adaptation, cross-phase decision making allowing for exceptions attains a 1.3 percent average performance improvement. An increase in power consumption of 2 percent is also observed; however, the energy consumption is unchanged, making this policy the best prediction-based adaptation strategy. These results show that concurrency throttling modules must consider the effects of changing configurations across phases, along with the local predictions for each phase.
Using cross-phase decisions while allowing exceptions results in an average 17.9 percent performance improvement over statically using all available execution contexts, further improving performance upon exhaustive search by 8.3 percent and upon hillclimbing by 6.8 percent. Additionally, the average performance loss compared to the statically optimal configuration is reduced to only 2.5 percent overall and 1.3 percent for applications with many iterations, showing that a flexible cross-phase decision policy is able to make performance-effective decisions. More importantly, the results for applications with a small number of iterations are within 3.7 percent of the statically optimal configuration compared to 31.6 percent and 26.5 percent for exhaustive search and hillclimbing, respectively, because of the significantly reduced exploration overhead. Our experimental platform has only eight feasible hardware configurations, and the performance advantage of ACTOR over the empirical search approaches is expected to grow in the future as the available number of processors, cores, and threads in a system rises.
The power-related results for ACTOR are just as substantial as those for performance. Energy consumption is the product of power consumption and execution time, and concurrency throttling attempts to reduce both, decreasing energy consumption by a still larger margin. We observe 10.8 percent and 26.7 percent reductions in power and energy consumption, respectively, compared to using all execution contexts. When compared to using the static optimal configuration, a 2.9 percent average reduction in power is seen, as well as a 0.9 percent reduction in energy. This result may seem surprising; however, it can be explained by the fact that the static optimal uses only a single configuration for the entire program execution rather than further decreasing the number of active processors for individual phases below the global optimal level.
ACTOR also sees a 1.1 percent reduction and a 0.8 percent increase in power consumption compared to exhaustive search and hillclimbing, respectively. Further tracing of this result shows that ACTOR executes the benchmarks with an average of 3.13 processors, whereas exhaustive search executes with 3.20, and hillclimbing executes with 3.02 processors. However, ACTOR reduces the total energy consumption by 10.2 percent and 6.3 percent, respectively, because of its performance advantages. These results indicate that prediction-based adaptation is able to make effective decisions both in terms of improving execution time and reducing energy consumption.
Overall, prediction-based adaptation outperforms or matches the performance of direct-search-based adaptation on all fronts. Additionally, it does not require applicationspecific/input-size-specific offline analysis while still achieving results very close to the static optimal for performance and better results for power and energy. Performance-prediction-based adaptation as utilized in ACTOR thus proves to be an effective strategy for improving the performance and energy consumption of parallel applications.
CONCLUSIONS
The performance and power characteristics of applications on emerging systems demand the consideration of throttling concurrency. In this paper, we have presented a novel approach to adaptive concurrency throttling that uses information collected at runtime to predict the performance of an application across various hardware configurations. By applying multivariate regression analysis to hardware event rates, DPAPP is able to characterize the performance and scalability of a given program phase. Our predictor allows for the online identification of performance-effective and energy-effective concurrency levels and thread placements while keeping the overhead at manageable levels. Over a range of multithreaded scientific benchmarks, the predictor was shown to be quite effective at locating the optimal configuration for each phase due to a low median error of 12.6 percent.
We also describe ACTOR, a new prediction-based adaptive concurrency throttling system, which we show to outperform adaptation strategies based on empirical searches of the configuration space due to reduced exploration overhead. We further optimize the system by introducing crossphase awareness into the decision-making process, thereby allowing it to consider potential cache effects of changing configurations between phases. Adaptive concurrency throttling is shown to be significantly more effective than simply using all available execution contexts for all phases, with improvements of 17.9 percent in performance, 10.8 percent in power, and 26.7 percent in energy consumption. The use of ACTOR yields performance results comparable to offlinederived application-specific/input-size-specific decisions, without requiring additional application-specific/inputsize-specific offline analysis. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
