Modeling the performance of scientic applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel's second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of minibenchmarks and application kernels. The results show that our KNL model can project the performance with prediction errors of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.
facets of an application on a target system. Approaches for modeling the performance range from discrete-event simulations to queuing models. Analytical modeling is encapsulating varying degrees of interactions between applications, system software, and architecture by using closed-form expressions for predicting performance metrics. Performance proling tools, (e.g., TAU [7] , HPCToolkit [5] ) focus mostly on the performance characteristics of a given implementation. These tools rely on the actual execution of the workload on the hardware. Architecture simulators are able to reveal performance responses of various hardware congurations, but they treat workloads as black boxes and are time consuming. Application performance models [3, 9] summarize the asymptotic performance characteristics. They can estimate performance bounds at a coarse granularity; however, they express only the characteristics of a given implementation without capturing the internal relationship between the control and the data ow.
Recently, performance projection frameworks have been used to reduce the amount of human eort and to streamline performance projection process [6, 8] . They try to overcome the limitations of general-purpose performance tools and adhoc modeling practices. These frameworks abstract the workload's behavioral properties and the hardware characteristics and combine both to project performance. These properties are data ow, control ow, computation intensity, concurrency, and communication patterns. The hardware characteristics are available execution units, instruction set, order of executions, memory hierarchy, network, and I/O.
In this paper, we focus on the SKOPE [1] performance projection framework. Given a formalized description of the workload's performance behavior, SKOPE automatically analyzes, tunes, and projects the workload's performance for a given parameterized target hardware. The frontend of SKOPE is a code skeletons language, a uniform description of the semantic behavior of a workload. According to the semantics and the structures in the code skeleton, the backend explores various transformations, synthesizes performance characteristics of each transformation, and evaluates the transformation with various hardware models.
The SKOPE hardware models perform incremental instruction scheduling. This was sucient to model in-order architectures such IBM Power A2 or Nvidia GPUs; however, most current-generation processors schedule instructions out of order and execute them using multiple, not necessarily symmetric, instruction pipelines. In this paper, we present the SKOPE extension to model these architectural features. We will present the KNL hardware model as a case study for validation. Our key contributions are:
• We extend the SKOPE language denition and its parser to enable data dependency specication at the control statement level.
• We develop a data dependency analysis and a scheduling algorithm. The analysis produces a dependency graph by identifying write-after-read, read-after-write, and write-after-write dependecies. The scheduling algorithm adopts a critical path algorithm graph and schedules the instructions on mutiple pipelines.
• We develop a KNL analytical hardware model using the publicly available data on KNL processor.
• We demonstrate the eectiveness of the new models by projecting the performance of minibenchmarks and of microkernels. We achieved prediction accuracy between 80% and 90%.
• We describe the procedure to measure the latency at the granularity of a few cycles, which can be used on other architectures. The paper is organized as follows. An overview of SKOPE and the extensions to the SKOPE hardware model are described in Section 2. In Section 3, the KNL-specic parameters used for the projection are discussed and the hardware model is validated by comparing the performance on the hardware. In Section 4, we summarize the work and then discuss future work.
BENCHMARKS AND SKOPE EXTENSIONS FOR KNL
The HACC framework uses N-body techniques to simulate the formation of structures under the inuence of gravity in universe.
HACCmk is a key routine that calculates the particle force with an O(N 2 ) algorithm. The source code and the skeleton of the HACCmk are shown in Listings 1 and 2 repsectively. Nek5000 is a high-order, incompressible Navier-Stokes solver based on the spectral element method. For our study, we use the two kernels mxf12 and glsc3i from Nekbone, a simplied version of Nek5000 These are matrix multiplication with a 12x12 size inner product and vector dot product kernels, respectively.
SKOPE
SKOPE is a performance projection framework. Given a formalized description of the workload's performance behavior, SKOPE automatically analyzes, tunes, and projects the workload's performance for a target hardware. The SKOPE language [1] is the front-end of the framework. Its syntax allows the modeler to specify how input data may aect the control and data ow. The key aspect of the SKOPE language is to express what the workload needs to do algorithmically, without specifying how it is done in the current implementation. The resulting description, referred to as a code skeleton. The SKOPE framework has been used to model not only parallel applications and parallel architectures [2] but also distributed workows [4] . To model KNL architecture that uses out-of-order execution on multiple instruction pipelines, we have extended SKOPE with data dependency analysis and a scheduling algorithm.
The dependency analysis module checks for read-after-write, write-after-read, write-after-write dependencies. In fp, xp, and fma statements, the dependencies are expressed as 3-tuples (dependent variable, ":", a list of independent variables). For each BST, based For out-of-order scheduling with multiple pipelines, we use the dependency analysis graph G, the latency of each node n i in G, and the number of execution units on the hardware. First, unweighted G is converted to weighted G by assigning a weight w i j to edge e i j equal to the latency of execution the node n i . The source n i and the destination n j of an edge e i j are called the parent node and the child node, respectively; a node without a parent is called an entry node; a node without a child is called an exit node. A node cannot start execution before all parent nodes have nished execution. The objective is to assign the nodes of G to the execution units such that the total schedule time is minimized without violating the dependencies. A schedule is ecient if the execution units are used without idle cycles. We adopt critical path and list scheduling heuristics on G for out-of-order scheduling on multiple pipelines.
VALIDATION ON THE KNL HARDWARE
We use the KNL hardware model in SKOPE to project the performance of compute kernels. The rst challenge involves validating the results when KNL is not yet available. Since the architecture simulator is not public, we validated the model on the KNL hardware once it became available. The second challenge is related to the accurate benchmarking. Since a singe call of a compute kernel consumes a few thousand cycles, accurately measuring the time is important. We developed a low-overhead microbenchmarking methodology to accurately measure the cycle and instruction counts. We validated the hardware model using microbenchmarks rst and then using the HACCmk and Nekbone compute kernels.
HACCmk and Nekbone Kernel Validation
We use the code skeletons of the HACCmk and Nekbone kernels as input to the KNL hardware model to derive performance projections. We validate the projected performance by running these benchmarks on the KNL hardware.
For a given code skeleton, the abstract execution model in SKOPE experiments with dierent code transformations and picks the best code transformation heuristically. One such code transformation is loop unrolling. Table 1 shows the projected cycles for the HACCmk microkernel with three input sizes using dierent unroll factors from 1 to 16. The hardware model picks four as the best unroll factor based on the analysis of number of stalled cycles in the execution and the number of data elements (representing the registers on the hardware) saved. Next, we compare the cycles projected by the hardware model using the unroll factor of four with the measurement on the hardware. Table 2 shows the cycles measured on the hardware and the corresponding prediction error. In Figure 2 , we show the percentage of prediction error as a function of the problem size. We observe that the error starts 21.5% for a smaller problem size to 9.8% for a larger problem size. These reasonably high prediction accuracies validate the KNL hardware model, where the unroll factor predicted by the hardware model concurs with the loop unroll factor selected by the compiler/hardware.
The model extensions discussed in Section 2 focus primarily on the instruction scheduling and out-of-order execuction. Intuitively, this information should be sucient to project the performance with reasonable accuracy for compute-intensive kernels such as HACCmk. The primitive memory model that is part of the hardware model would be crucial to the memory-intensive kernels such as the Nekbone kernels. The memory model accounts for the latency of a memory operation for the rst touch, and it accounts for the locality in the stride-1 memory access streams.
We now use the hardware model to project the performance of the Nekbone microkernels mxf12 and glsc3i. Tables 3 and 4 respectively show the prediction errors for the projection of kernels mxf12 and glsc3i as -10.33% and -8.30%. The hardware model was able to project the performance well even for these kernels. These reasonably good prediction accuracies validate the memory modeling aspects of the hardware model. Table 3 shows the validation for the mxf12 kernel with three matrix sizes. While the prediction error for the smallest matrix size is relatively high compared with the bigger matrix sizes, the overall prediction accuracy is reasonably good. Also, the instructions projected by the hardware model match closely with the instructions retired as measured on the hardware. As shown in Table 4 , the hardware model selects 4 as the best unroll factor for the glsc3i kernel as well. 
CONCLUSION AND FUTURE WORK
We have developed a hardware model for the second-generation Intel Xeon Phi architecture code named Knights Landing, by extending the SKOPE execution model to support out-of-order instruction scheduling and pipelining. The model was used to project the performance of HACCmk and Nekbone kernels that are derived from critical regions of two exascale applications, HACC and Nek5000, respectively and was validated by using the runs on the hardware. The model can be used to project application performance and suggest eective code transformations on the target hardware even before the production runs. The model can help performance engineers and hardware designers set performance-tuning goals, select code optimizations, and, above all, identify which hardware features are more suitable for their applications. This information can help them advocate for certain proposed hardware features by the vendors in future microarchitectures such as Knights Hill. Future work includes extending the SKOPE framework to support a wider selection of hardware models so that it can be used to study various computer architectures. We also plan to extend the skeleton language to model applications with a specic focus on data-ow-based optimizations, asymmetric pipelines, and more code transformations such as loop splitting and loop fusion.
