Abstract-With the dramatic increase in scale expected for Exascale computing, there is a dire need for tuning of hardware configurations and software optimizations such that they are in unison. However, the expected increase in tunable hardware parameters makes searching through the design space for optimal hardware-and-software configurations much more challenging.
I. INTRODUCTION
Conventional optimizations for high-performance computing have been focused on software tuning (e.g., compile [1] - [3] and runtime optimizations [4] , [5] ). However, there is growing evidence of performance gains when hardware configurations and software approaches are optimized in unison [6] - [10] . Determining coherent hardware and software optimizations is challenging, partly because of the increase in the search space (i.e., permutations of tunable parameters); brute force search of optimal hardware configurations and software tuning quickly becomes impractical [7] .
We anticipate that this challenge will become much more difficult with the increase in scale for Exascale computing and the expectation that there will be an increase in reconfigurable hardware parameters for emerging architectures [11] -such as CPU and memory frequency, number of cores.
A number of techniques have been proposed that are geared towards understanding application behavior so that the findings may be applied to achieve near-optimal performance of applications. These methods include analytical modeling [12] - [14] , simulators [15] - [17] , emulators [18] , and co-design modeling techniques [19] , [20] .
In the context of Exascale computing, it is paramount that the methods predicting application behavior are accurate, scalable, portable, involve minimal overhead, and are easy to use [21] . While the aforementioned methods offer promising results towards the goals defined by the researchers, most approaches do not take into account all the Exascale computing goals listed above. Although the co-design modeling methods take several goals into account, the lack of automation and limited scalability leaves room for improvement.
Towards this end, we propose Prometheus, a composable hardware-software optimization framework. As part of Prometheus, we develop a combination of analytical and machine-learning techniques to capture application characteristics. These application characteristics are subsequently used to determine the hardware-software configuration for nearoptimal performance and energy improvements.
Prometheus uses Aspen [22] ; a domain-specific language for abstract application and machine representation and fast exploration of the optimization search space. We leverage Aspen's ability to represent application and machine behavior and develop an automated optimization-search-space exploration framework around Aspen.
We evaluate Prometheus for its efficacy using two widely used proxy applications: LULESH and CoMD [6] , [23] . We demonstrate that Prometheus identifies near-optimal hardwaresoftware configurations and verify the results via brute-force search of the design space.
We present the following contributions in this paper: 1) Analytical and empirical modeling techniques that capture application characteristics, highlighting the sensitivity of performance and energy to hardware parameters; 2) An automation framework that uses Aspen to represent abstract application and machine models, and thereby programmatically explore the tuning search space for the near-optimal performance gains and energy efficiency; and 3) Evaluation of Prometheus' efficacy on a local cluster as well as NERSC's Cori and discussion of relevant concerns using two proxy applications.
II. BACKGROUND & RELATED WORK
We classify the prediction methods into the following classes: analytical modeling, simulation, emulation, empiricalmeasurement based modeling, domain specific languages. We compare these methods in terms of seven qualitative parameters. Accuracy: degree of errors (prediction vs. empirical observation). Portability: ability to accommodate future architectures. Overhead: amount of time and computational effort. Scalability: valid results with strong and weak scaling. Automation: degree of user intervention required. Generality: applicable to a broad range of applications and hardware configurations. Ease of Use: complexity of method and degree of effort required in obtaining predictions Analytical-based modeling methods (such as [2] , [3] , [13] , [24] - [27] ) describe the characteristics of the application and hardware using mathematical models and estimated behavior. These methods include but are not limited to statistical, machine learning, and black-box modeling techniques. These models are typically developed using a training data set and therefore their estimates are valid only within the range that training data set covers. Thus, we lose confidence in the predictions when parameters are varied beyond the range of training data.
Simulators (such as [16] , [17] , [28] , [29] ) provide good flexibility to incorporate changes in the architectures. However, they do present a challenge in terms of high development costs, use and overheads. As with analytical models, the predictions loose confidence when they are used for predictions beyond the scope of which they were designed (e.g., scaling).
As with simulators, emulators (such as [15] , [18] , [30] ) mimics the hardware behavior. Although emulators can provide much more accurate predictions, they lack portability and are unable to estimate application behaviors for new architectures, until they are implemented as part of the emulator.
Empirical modeling methods (such as [3] , [13] , [31] - [33] ) use partial but direct measurements to predict application behavior. These methods use measurements such as execution time, communication costs, and measures of various performance counters. As the measurements are obtained directly from the underlying hardware the confidence of predictions is much higher. Some measurement tools exist that can partially automate the process. Such methods are portable if the underlying measurement infrastructure is portable.
Domain Specific Languages and co-design frameworks [19] , [20] , [22] , [34] , [35] have been developed to explore the combined effects of hardware and software optimizations. The manner in which these frameworks are designed dictates their characteristics, such as accuracy, portability, overheads, scalability, automation, generality, and ease of use.
For example, Aspen [22] on its own is known to be portable and scalable. However, as with other co-design implementations, the frameworks require manual configuration changes in abstract models to explore the design search space and run experiments.
In this paper, we build upon Aspen [22] to explore the composable nature and increase its accuracy by using suitable modeling methods based on contexts. We develop a composable framework around Aspen that embeds application characterization and machine-learning components. We also automate the workflow around the framework to increase its ease of use and enable automatic search of the design space -this automation includes characterizing applications, generating and evaluating performance models, and finally generation of abstract application and machine models to evaluate. These work flows are described in detail in §III.
Prometheus provides improved performance by representing the application and hardware using outcomes of machine learning frameworks combined with Aspen abstractions. Section §III presents additional details. • Automatic generation of abstract application and machine models using Aspen.
• Automatic reconfiguration of hardware components for exploration of tuning and performance evaluation.
• Modular and composable implementation of Aspen by integration of performance and energy measurement and modeling methods (in the form of Aspen+). We use Aspen to represent application and machine models that allow us to explore the variation in configurations and design space. Prometheus differs from Aspen in the level of automation and the added feature in Aspen to explore the design space of hardware.
Because of the limited reconfigurable options (e.g., varying clock and memory frequency) in Aspen, we develop a framework that provides us with the ability to experiment with a wide range of significant hardware components in an abstract manner and analyze the outcomes on performance.
First, Prometheus classifies applications (e.g., compute, memory, or I/O bound). Based on the application characteristics, Prometheus determines the hardware components that affect the application performance and energy consumption by analyzing the hardware performance counters. Among the hardware components that influence performance and energy consumption, Prometheus selects the top contributors and analyzes the effects of varying their configurations.
The workflow of Prometheus involves three major steps, as shown in Figure 1 . These steps are:
• Application classification, • Identifying and exploring the significance of hardware components using supervised machine learning algorithm, and
• Automatic application and machine modeling; runtime and energy estimation using Aspen+.
We describe these steps and components below.
A. Application Classification (Step-1)
Application classification (both in terms of performance and energy consumption) helps to classify the applications in terms of their inherent characteristics (e.g., compute bound, memory bound). Such high level classification serves as a guideline for the supervised machine-learning algorithm ( §III-B).
There are a number of techniques that can be used to determine applications' execution pattern. We use Roofline method [36] to relate the performance and operational intensity of the application with the platform's peak performance and memory bandwidth.
We use the Roofline model of energy [37] to capture the cost of performing an operation both in terms of time and energy consumption. Here we use the relationship of energy per FLOPs and energy per byte in addition to the operational intensity exhibited by the application.
This preliminary analysis of the application serves as a guideline for determining the class of application (i.e., whether the application is compute, memory, or I/O intensive) and therefore enables us to observe the relevant set of performance counters.
B. Exploring Significance of Hardware Components (Step-2)
We develop a machine-learning model by identifying the hardware components that play a significant role in the application's execution using performance counter collection and analysis.
A summary of the steps involved in exploring the significance of the hardware components are: 1) Exploring the activity of hardware components using performance counters. 2) Short-listing significant hardware counters using supervised machine learning. 3) Mapping of significant counters to corresponding hardware components using machine learning based classification techniques.
The steps involved in finding the significance of hardware components are explained below in order of execution.
1) Metrics and Counters:
To identify the significant hardware components we consider the performance counters relevant to the class of application identified in step III-A. In this work, we use LIKWID [38] to collect performance counters. LIKWID is an open source tool that enables collection of both core and un-core events of an application's run. In Prometheus we use preset counter groups including UOPS, UOPS_RETIRED, UNCORECLOCK, UOPS_EXEC, FLOPS_DP, CYCLE_ACTIVITY, etc,.
2) Generation of Supervised Machine Learning Model:
The next step is to determine the significance of each collected counter group with respect to runtime and energy consumption. We use a supervised regression based machine-learning algorithm to find significant counters. Several conditions need to be fulfilled to enable efficient machine learning based modeling approaches -e.g., satisfying confidence interval criteria, avoiding model over and under-fitting, choosing an appropriate model for the data set, explaining the outliers etc,. After fulfilling these conditions, we generate supervised machine learning models for runtime and energy consumption using equation 1, where the outer summation indicates repetition of measurement for statistical rigor and to gather median and quartile statistics.
β is the performance counter coefficient, X is performance counter type, α is the intercept, F (X) is estimated energy value, n is iteration count/number of samples of performance counters, m is the number of hardware counters and Z represents the non zero performance counters even when system is idle, e.g., clock frequency.
3) Mapping of Significant Counters to Hardware Components:
After shortlisting significant counters, we map them to their corresponding hardware components using classification analysis. We use a supervised machine-learning algorithmfuzzy c-means clustering algorithm [39] -for classification of performance counters. We classify counters based upon their characteristics/type.
There may be scenarios where some counters correspond to multiple hardware components, for example, L3_to_MEM_BW and L3_to_MEM_Data_Volume correspond to L3 caches as well as DRAM bandwidth, therefore fuzzy c-means clustering algorithm [39] works best for our experiments. Fuzzy c-means allows clusters to overlap (more mathematical details can be found in FCM [39] ). After a mapping of significant counters to hardware components is generated, we pass this information to Aspen.
C. Application and Machine Modeling (Step 3)
After we have identified all significant hardware components, we prune the design space by varying configurations of significant hardware components using Aspen's modeling framework.
1) Automatic Application Model Generation using Compass in Aspen+: Any application and hardware need to be formatted in Aspen's specific grammar by abstracting the characteristics of application and hardware in order to execute in Aspen's environment. One automated way is presented in Compass [40] . Lee et al., [40] uses the OpenARC compiler to generate an abstract application model from source code. An abstract application model can also be generated manually by analyzing the control flow and memory access patterns of the application.
2) Automatic Machine Model Generator in Aspen+:
We use a combination of Linux kernel and user level functions to extract important hardware configurations -e.g., number and type of sockets and cores, clock and memory frequencies, memory bandwidth, latency, capacities and hierarchies etc,. We extract and sort the configuration in a hierarchy, similar to that presented in [22] , insert Aspen annotations at appropriate positions and test the machine model for correctness and accuracy.
3) Varying Hardware Configurations in Aspen+: Abstract Machine Model Configurator: Once we have automatically generated the abstract machine model, we can make changes in the hardware configuration using the Abstract Machine Model Configurator (AMMC). Using AMMC, we change the configuration of significant hardware components identified as an outcome of the classification tree in the previous step.
AMMC changes the configuration between minimum and maximum range using either of the following two ways:
• Using system specification, for example, to find the reconfigurable range of clock frequency, we can use Linux functions. For example our experimental system has the capability to change system frequency between 1200 MHz to 3500 MHz.
• Using statistical methods to include the manufacturer's provided values and manual variations to explore a range. Doing so ensures that the minimum and maximum values are within the manufacturer specifications. Using the significant hardware components identified as part of the above process we change configuration of each hardware component in isolation as well as in combinations of two hardware components.
4) Calculating Runtime and Energy Consumption for Application Execution via Aspen:
As we consider runtime and energy consumption as performance metrics, we use the original runtime modeling framework proposed by Aspen.
a) Runtime Calculation in Aspen:
The runtime calculating code in Aspen takes an application model and a machine model as input. Aspen generates a symbolic expression for how long the application will take to run for a given machine, and then substitutes all of the hardware and application parameters and subsequently evaluates the expression to calculate the runtime using a throughput-based analytical model. More details can be found in Aspen by Spafford et al. [22] .
b) Energy Consumption Calculation in Aspen:
The energy consumption calculation and evaluation is based on [41] , [42] , which proposes activity factors to determine the energy consumption in the system. Gamell et al., [42] is composed of energy consumed by processor and memory. More details on the mathematical model can be found in [41] , [42] .
After following the Prometheus workflow, we test the application with the configurations of all significant hardware components identified by machine-learning model in Aspen framework. All significant hardware configurations are changed in isolation and in combination of two hardware components, until we cover all combinations. An overall summary mentioning the hardware component that results in highest sensitivity towards runtime and energy is provided to the user at the completion of Prometheus framework.
IV. RESULTS AND ANALYSIS
One strength of Prometheus is to model the behavior of application on varying hardware configurations and using this information to optimize the performance design space for current and future architectures.
To study, develop and evaluate Prometheus, we developed an abstract machine model in Aspen consisting of Intel's Haswell processor. Each node in the abstraction contains an 8-core 64-bit Intel's Haswell processors with 32-GB of system memory. Nodes are interconnected by an InfiniBand network. The abstract machine model implements a multi-core processor with SIMD capability, out-of-order execution and latency hiding capabilities. The validation results are presented in section ( §V).
A. Proxy Applications
We analyzed LULESH and CoMD in this paper. LULESH explores the performance of parallel static unstructured mesh applications [23] . It solves the hydrodynamics equation by separating the spatial problem space, using hexahedral mesh generations.
CoMD is a simulation code [6] that calculates the potential among elements. A few significant factors dictating the performance of CoMD are setting up a distance, calculating the potential within that distance, and maintaining a specific number of atoms within a cell under observation.
B. Analysis and Discussion
In this section, we present the results of varying hardware configurations in terms of sensitivity analysis and scalability testing on single and multi-node settings. Sensitivity is calculated using variations in the output parameter corresponding to a change in input parameter. We present normalized sensitivity results for runtime and energy improvements. For each application, we will answer the following research questions:
• We investigate the sensitivity analysis of all the significant hardware components (short-listed by machine learning algorithm in Prometheus) for runtime and energy.
• We present scalability (strong and weak) analysis for the hardware configuration that provide the most sensitivity towards runtime and energy improvement.
• We validate the results of system clock and memory bandwidth on current high-performance computing systems.
• We validate the scalability results on the NERSC's Cori system. 1) Case Study-1: LULESH: LULESH exhibits both compute and memory intensive properties; however, it shows trend towards the memory intensive group on our experimental system. We followed the Prometheus workflow by first applying Roofline model [36] and Roofline model of energy [37] to characterize the application, followed by collection of hardware performance counters using LIKWID. We used the machine learning algorithm on the training set of performance counters to identify the significant counters. The significant counters are mapped to their corresponding hardware components using the abstract machine model configurator. We changed the configurations of the significant hardware components in Aspen's abstract machine model. Figure 2 shows the effect of changing significant hardware components on runtime and energy for LULESH in isolation as well in combinations of two components. a) Sensitivity towards runtime and energy improvement: Figure 3 shows that runtime is most sensitive to system clock and number of cores individually and in combination. Hourglass kernel is most sensitive to system clock frequency and the Stress calculations kernel is most sensitive to the number of cores, and these two are a few of the most influential kernels in the overall application execution. Therefore, LULESH shows improved performance with system clock frequency and number of cores. Figure 2 shows that energy efficiency is sensitive to multiple hardware components -i.e., system clock, memory clock and number of cores in isolation. Energy consumption is also sensitive to the combination of system clock and number of cores and combined configuration of system and memory clock. Similar to runtime, hourglass calculation, element connectivity and stress calculations are the most energyconsuming kernels. Each of them have different computation and memory access characteristics, and that is why there are a number of hardware configurations that can be used to improve the energy efficiency.
b) Scaling effects for runtime and energy improvement: Figure 3 shows the strong and weak scaling effects with respect to changing system clock and the # of cores in combination (identified as the hardware component that provides the improved performance) for the runtime and energy improvement of LULESH. The execution time reduces with increasing number of nodes, which means there is a lot of opportunity to improve application performance using parallelism for both runtime and energy. However, LULESH's runtime is not improved using weak scaling.
Since LULESH performs all the calculations locally, it shows a lesser trend towards weak scaling. We are using MPI version of LULESH; we believe that a hybrid MPI-OpenMP version will have degraded weak scaling effects as there will be more data movement involved to reduce the race conditions for communication critical kernels e.g., hourglass and stress computations. Figure 4 shows the sensitivity analysis for runtime. Increased number of cores in isolation and system clock and memory bandwidth in combination show a major improvement in runtime. There are two major runtime dominant components in CoMD, the potential calculation within a node and the communication between nodes. For potential calculation within a node -potential, velocity and force are organized as 3-dimensional data which requires significant memory accesses. Therefore increasing memory bandwidth with increased clock frequency enables more data accesses at a faster rate and a consequential effect of faster intra-node potential velocity and force calculation due to faster clock speed.
Sensitivity of hardware components for energy improvement is shown in figure 4 . Energy consumption is most sensitive to memory clock as well as the combine configuration of system clock and memory clock. The intra-node computations (velocity and position calculations) are some of the time and energy consuming components in CoMD. The optimal hardware configuration proposed by Prometheus is increasing memory clock frequency. Increasing memory clock frequency will help increase the number of memory operations per cycle, which will improve the overall execution time. b) Scaling effects for runtime and energy improvement: Scaling effect with best hardware configuration -i.e., the effects of increase in system clock and memory bandwidth together -are shown in figure 5 . Inter-node computation is a significant component in CoMD execution, therefore execution time improves with increasing number of nodes.
Weak scaling has no significant effect on CoMD. Since the number of messages transmitted during each iteration increases with increasing processes, therefore communication overhead increases resulting in increased runtime and energy improvement.
V. VALIDATION RESULTS
We performed three sets of validation experiments i.e., varied system clock and throttled memory bandwidth results on a local cluster composed of two physical CPUs, each containing an 8 core 64-bit Intel Haswell processor (Xeon E5-2637) at 3.50 GHz.
We performed strong and week scaling validation experiments on Lawrence Berkeley Lab's (NERSC) Cori system. We used the first phase of Cori's system consisting of 1632 dual socket compute nodes that comprise of two 2.3 GHz 16-core Intel's Haswell processors and 128 GB of DRAM per node.
A. Varied clock bandwidth validation results
We enabled DVFS using Linux user and kernel level commands. We varied the system clock frequency between system provided clock range, our system supports varying clock frequency between 1200 MHz to 3500 MHz. Table I shows the cumulative effect of varying clock frequency on LULESH and CoMD. Table I shows that measured results closely matches the predicted results by Prometheus. The measured results always provide less value than predicted results, since Prometheus ignores the runtime overhead.
B. Memory bandwidth throttling validation results
Our idea of memory throttling is inspired by Gremlin emulator [18] -a tool that emulates resource restriction for the architecture of next generation HPC systems. It provides four classes of resource restriction, power, memory, resilience and noise.
We tested our system for constrained memory bandwidth using Stream benchmark. We run STREAM benchmark [43] simultaneously while executing the proxy application under observation (LULESH, CoMD) on a memory bandwidth constrained system. We also ran varying instances of STREAM benchmark to throttle the bandwidth in a piecewise manner. Table I shows the comparison between measured and predicted sensitivities towards memory bandwidth.
The measured values are higher because of the inability to capture the system noise and other system variances that comes into effect while observing measured values.
C. Validation of scalability results for runtime on Cori
We tested scaling results on NERSC's Cori system. We used phase-1 of Cori to test the scalability of the proxy application( Figure 6 ). We tested CoMD for the strong and weak scaling analysis. CoMD's execution time decreases with increasing number of processes, which validates the increasing sensitivity for strong scaling (figure 5). CoMD's runtime does not improve with weak scaling as proposed by Prometheus (figure 5) 
VI. CONCLUSION
Hardware resource optimization holds the same importance and priority as application optimization as we move towards the exascale era. It is crucial to understand the effect of hardware components on an application execution as it plays a key role in the overall performance improvement of application and system in conjunction. We present a methodology: Prometheus that uses application signature as a guideline to identify hardware bottlenecks, measure the sensitivity of significant hardware components, and configure selected hardware components to show the effect on the application performance and energy consumption. Prometheus encapsulates a combination of analytical and machine learning techniques, and uses Aspen DSL as a modular and composable tool to extract the benefits of the various modeling techniques e.g., Roofline, supervised machine learning etc. Aspen DSL also helps us in changing and testing configurations of hardware with applications and showing their effect on LULESH and CoMD.
