Design space exploration (DSE) is becoming increasingly complex and 3D integration compounds the problem by imposing a more complex design space. Moreover, 3D design is a new frontier of CPU design, expected to rely more heavily on statistical modeling than designer intuition. Past work has used regression modeling with random sampling of the design space. In this paper we propose a directed simulation technique where intermediate predictions more efficiently direct simulation resources. Our results show over 98% model accuracy while simulating less than 5% of the design space and reducing simulation time more than 4.5x compared to random sampling.
INTRODUCTION
Design space exploration (DSE) involves the evaluation of a multitude of design choices prior to detailed implementation. Such a technique is necessary to identify regions of interest in the design space and perform educated trade-off analysis of conflicting objectives. In its simplest form, DSE can be performed by exhaustively simulating the entire design space. However as CPU designs become ever more complex in the pursuit of Moore's law performance scaling, the DSE problem has become increasingly intractable as the design space grows combinatorially in the number of design parameters. Exhaustive simulation across such large design spaces is inefficient and potentially infeasible or unaffordable in terms of run-time.
Past work has attempted to overcome the computational infeasibility of exhaustive simulation in two ways. One is to reduce simulation time by orders of magnitude using techniques such as host-compiled simulation [2] or statistical simulation [3] . Although these approaches can make exhaustive simulation possible, the accuracy of such fast simulation techniques is reduced, and the applicability of the techniques is limited in scope. Another approach to the DSE problem is to simulate only a small subset of the the full design space and use modeling techniques to predict the properties of un-simulated designs. Modeling approaches [8, 7, 11, 9] have shown promising results on large architectural design spaces.
Vertical integration of circuits (3D ICs) is an up-and-coming technology that shows great potential towards improving circuit power and performance as well as facilitating new CPU architecture paradigms such as stacked memory and highly connected onchip networks [18] . However 3D ICs also bring new challenges, chief among them thermal management [15, 19] , and moves the architectural design problem into uncharted territory where traditional domain knowledge and designer intuition may no longer apply. Moreover, past work [18, 19] has shown that significant portions of the 3D CPU design space can be infeasible due to physical conPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. GLSVLSI '17, May 10-12, 2017 • 3D integration brings many new architectural opportunities that compound the intractability of exhaustive simulation.
• The effects of these new architectures on the design trade-off space are currently not well understood.
• 3D ICs are more thermally sensitive to architectural changes than equivalent 2D chips due to their physical structure.
• 3D ICs can eliminate communication bottlenecks that are inherent in 2D ICs, making performance and power more sensitive to architectural changes.
• Ad hoc fixes late in the design cycle due to poor architectural design choices can be more costly in 3D ICs because of resource conflicts between transistors and TSVs. Physically aware DSE is becoming more important, especially in the context of 3D ICs. Past work [13, 19] has examined the effect of physical constraints on a CPU design space, but has only done so with exhaustive simulation over a small design space. On the other hand, the literature on design space modeling [8, 7, 11, 9] has only attempted to model optimization variables such as performance or energy efficiency with no consideration of physical constraints.
In this paper we introduce a modeling and simulation technique for 3D CPUs. The proposed technique models physical properties (e.g., power, area and temperature) and traditional optimization metrics (e.g., instructions per second or energy-delay-product) and uses these models to direct simulation effort towards user-defined regions of interest in the design space for the purpose of identifying interesting trends such as the Pareto optimal trade-off curve. Our models accurately predict the performance and temperature of a diverse 3D CPU design space and identify the optimal feasible design point (Pareto optimal design set) with 100% (98%) accuracy while simulating less than 2% (5%) of the design space.
In the rest of this paper we discus the related work in the literature, describe our power-performance-temperature simulation methodology and design space modeling technique, and demonstrate its effectiveness and accuracy using two case studies.
Previous Work
Methodologies to facilitate large scale DSE have taken two orthogonal approaches: drastically reduce simulation time or predict the properties of un-simulated design points through design space modeling. Towards the former, techniques such as statistical simulation [3] and host-compiled simulation [2] have been pursued. Both techniques massively reduce simulation time, but at the cost of reduced accuracy and limited applicability.
Design space modeling likewise trades off accuracy for decreased simulation time by predicting rather than actually simulating significant portions of the design space. Historically, design space modeling techniques [8, 7, 11, 9] have used uniform random sampling to build models of the entire design space. Early work by Joseph et al. [9] used linear regression to model instructions per cycle (IPC) across a 23-variable CPU design space. However only two factors of each variable were considered which is not sufficient to evaluate the accuracy of the technique. Two similar works were published the same year by Lee and Brooks [11] andİpek et al. [7] . These applied spline regression and artificial neural network models to accommodate multi-factor non-linear design space properties. They yielded average errors less than 10% and maximum error around 50%. More recent work by Jia et al. [8] applied spline regression to GPUs. This technique reduced maximum error to around 15% and had average error in the single-digit range. However, a limitation of the current state of the literature is a lack of focus on application of the proposed modeling techniques to efficiently solve DSE problems.
Moreover, in the aforementioned prior art there is a missed opportunity. A significant potential advantage of modeling approaches is the ability to control the accuracy of the model in different regions of the design space, which we refer to as directed simulation. This is important because it is often the case that accuracy of the simulations is only important in a small subset of the design space, such as the Pareto front for the design objectives at hand, or the region of physically feasible design points. Directed simulation can improve the efficiency of design space modeling by optimizing the amount of useful information in each simulation.
Contributions
This paper makes the following contributions:
• We propose a directed simulation technique that builds regression models to identify the region of the design space that is of interest to the designer and predict optimization metrics and physical properties within that region while only simulating a small subset of the design space.
• To the best of our knowledge our work is the first to apply design space modeling techniques to 3D CPUs. 3D CPU design is expected to rely more on design space modeling than traditional CPU architectures due to a lack of designer experience and intuition regarding this emerging technology and architectural paradigm.
• To the best of our knowledge our work is the first to apply design space modeling to physical properties such as temperature to predict the feasibility region of a design space. This is extremely important for designing 3D CPUs which are known to be heavily thermally constrained.
3D CPU SIMULATION FLOW
The simulation flow used to evaluate power, performance, area and temperature of a specific 3D CPU architectural design point is shown in Figure 1 . The details of each component of the simulation flow are explained in the following subsections.
Performance Simulation
Performance simulation is performed by Multi2Sim (M2S) [22] , a cycle accurate CPU simulator. Architectural definitions are passed to the simulator that describe parameters such as: number of cores, pipeline width, buffer/queue/register size, cache size/associativity/latency, network-on-chip (NOC) topology/latency etc. Cache and register latencies are determined using CACTI [21] to provide realistic architectural setups to the simulator. DRAM latency is calculated using the model proposed in [18] and NOC topology/latency is calculated as explained in Section 2.4. M2S simulates the execution of an x86 binary file on the described CPU. The simulator outputs a list of performance statistics such as IPC, memory reads, writes, hits and misses, branch prediction rate etc. The specific architectural configurations and software workloads simulated for this study are discussed in Section 4.
Power and Area Estimation
Power and area estimation is performed by McPAT [12] , a multicore power, area and timing estimation tool. Based on the provided architectural parameters, the area and energy per access and leakage power density of each CPU component is estimated for a selected technology node. Using the performance statistics generated by M2S, the dynamic power dissipation of each CPU component can be estimated. The leakage power density is estimated for a nominal temperature, but is thermally scaled during thermal simulation (Section 2.5).
Floorplan
Each core is partitioned into a set of components (e.g., data cache, register alias table (RAT), branch predictor, instruction queue and execution unit) which are interconnected by an abstract netlist. Detailed descriptions of each component can be found in [12] . A general core floorplan (FP) topology was generated offline which minimizes the wirelength between connected components.
Core Tiling and NOC Design
The core floorplan is replicated on an i × j × k grid such that i · j · k = n where n is the total number of cores. The values i, j and k are chosen such that:
• Total footprint area i · width core · j · height core < A max .
• Total number of layers is minimized.
• Layer aspect ratio ( i · widthcore /j·heightcore) is relatively square. NOC topology is defined as an i × j × k 3D mesh [18] and NOC latency is defined as the wire delay of length max(width core , height core ), calculated using the wire delay model from [16] with technology parameters extracted from McPAT source code. NOC topology and latency are fed back into the performance simulator to get accurate inter-core communication simulations.
Thermal Model
Once the chip floorplan has been constructed and component power estimation is complete, we have a power density map for each tier of the 3D stack. Power density maps are converted into thermal maps using our compact thermal model [20] . A 3D grid is constructed representing the physical structure of the 3D IC. Likewise the power map is discretized into a 3D grid and the total power of each power grid is assigned to the respective physical grid.
Leakage Model
McPAT reports a base leakage value for each CPU component which is estimated at a fixed temperature T 0 . To obtain more accurate leakage power estimates we iteratively solve our thermal model and then scale leakage estimates at each grid based on the estimated temperature of that grid after the previous iteration.
DSE MODELING TECHNIQUE
In this section we introduce our modeling and simulation technique for 3D CPU DSE subject to physical constraints. We use the smoothing spline analysis of variance (SS-ANOVA) [5] modeling technique to build models for each design parameters of interest (e.g., performance, temperature and power). Models are composed from cubic spline functions evaluated on combinations of design variables (i.e. model terms). First we give some background on SS-ANOVA modeling and then describe our proposed modeling and directed simulation technique. Figure 2 illustrates the overall flow of our modeling and simulation technique, and details are given in the subsections below. The basic flow is an iterative back-and-forth between model building and choosing new simulation points based off the intermediate model predictions.
SS-ANOVA Modeling
A spline is a piecewise polynomial function [5] . Splines are both differentiable and continuous at the piecewise boundaries [5] . The smoothing spline is a regression technique to smooth noisy data by fitting a spline function to the data. Analysis of variance (ANOVA) is a statistical technique for analyzing the underlying source of variations in a population [5] . Multi-factor ANOVA can be used to generate models of a data set as a function of descriptive properties of each observation. An observation f can be modeled as a function of the variables x = x 1 , x 2 ,...,x n as shown in Equation (1) [5] . SS-ANOVA limits the set of functions { f 1 ,..., f 1,2,...,n } to be spline functions which operate on some subset of the variables in x. Each unique subset of input variables is called a term, and the order of a term is the size of the subset.
In this work we use the gss [4] package for the statistical computing environment R [17] to generate a unique smoothing spline model for each design property of interest. To generate each model, gss requires a set of simulation data and a set of model terms. However, choosing the appropriate simulation points and model terms are nontrivial problems. The choice of model terms and simulations points strongly affects the quality of the model and suboptimal choices have a high cost in terms of total simulation time and model complexity. Our iterative technique for model term and simulation point selection and is explained in detail in the following subsections.
Choosing Model Terms
The maximum number of terms associated with n variables is 2 n . However as a rule of thumb a model is unreliable when the number of terms is greater than s /20 [6] where s is the number of simulated points. If too many model terms are used, the model can suffer from over-fitting, making it very accurate with respect to the observed data, but a poor predictor of the un-simulated data we wish to predict. Thus the number of model terms must be kept relatively small in order to maintain model accuracy when the number of simulations is small. The intended goal of the modeling and simulation approach is to build accurate models while requiring only a small number of simulations, so avoidance of the over-fitting problem is of critical importance.
The coefficient of determination (R 2 ) is a commonly used metric to evaluate model quality. However, by construction R 2 monotonically increases as new terms are added to a model [8] . Thus optimization of R 2 itself would inevitably lead to inclusion of all model terms, unnecessarily complicating the model and potentially causing over-fitting. Adjusted R 2 (R 2 ) (Equation (2)) scales R 2 relative to the number of model terms, m, and the number of data points, s, allowing the significant terms to be identified.
We use a forward selectionR 2 based technique to select the terms in the model. The model building technique is shown in the bottom half of Figure 2 . Starting with an empty model we consider each model consisting of one first order term. We evaluate theR 2 metric for each model and accept the one with the largest value. We then consider adding each remaining first order term (in decreasing order of improvement) and accept the terms that increase the quality of the model by at least θ .
Every time a new first order term is added to the model, we consider all second order interaction terms created by combining the new term with any other terms already in the model. Amongst all new second order terms generated this way we add (in decreasing order of improvement) any that cause the model quality to improve by at least θ . The model is complete once all first order terms have been added to the model, or when adding any new first order terms causes model quality to improve less than θ .
Adding Simulation Points
The designer defines a discovery metric, which defines the point(s) in the design space they are interested in identifying. Some examples of potential discovery metrics are the optimal design point subject to a set of constraints (e.g., design space optimization), or the set of Pareto optimal designs (e.g., trade-off analysis). The optimality metric (e.g., performance or energy efficiency), constraints (e.g., temperature, power, area or timing) and Pareto metrics are defined by the designer. The goal of our proposed modeling and simulation technique is to identify these points by iteratively predicting them and concentrating simulator effort around the predicted point(s) to improve the accuracy of the prediction.
Initial models are built using a random sampling of η simulation points. Using model predictions, the predicted design point(s) of interest are identified. Due to model error, the identified point(s) are not necessarily the true points of interest. However, the true points of interest are likely to be close to the predicted points of interest. Thus a region of interest (ROI) is defined which contains the design points which are close to the predicted point(s) of interest, and additional simulation effort is concentrated towards this ROI to improve model fidelity in that region. Each iteration of the flow identifies χ new design points from the predicted ROI and queues them for simulation. Once the simulations are performed, the model is rebuilt and the process repeats. If the initial model mispredicts the ROI, additional simulation effort in the mispredicted region will reduce model residuals in that region and improve ROI prediction on the next iteration.
Stopping Criteria
Stopping criteria could involve reaching a maximum number of simulations, or a sustained convergence in predictions of ROI and/or point(s) of interest across multiple iterations. In order to evaluate the behavior of our algorithm we simply set the stopping criteria to terminate when the total number of simulations reaches ζ . We investigate the trade-off between number of simulations and optimality of our selected design space in Section 5, and the point at which prediction convergence is achieved is observed post hoc.
EXPERIMENTAL SETUP
In this section we describe the experimental setup to evaluate the proposed modeling and simulation technique (Section 3). In the following subsections we introduce the 3D CPU design space, the discovery metrics and associated ROI definitions considered in our case studies and the metrics we use to measure the effectiveness of our approach. Results are presented and discussed in Section 5.
Architectural Design Space
Our study searches the architectural design space given in Table 1 which contains 4374 unique design points. We consider 3D CPUs with stacked DRAM which is considered to be one of the primary advantages of 3D CPUs [10, 14] . By integrating the DRAM on chip using through silicon vias (TSVs), the core-memory bandwidth can be increased drastically due to increased memory bus width, more memory controllers and faster bus speed [15, 18] .
Each architectural design point is evaluated using a set of software workloads from the SPLASH-2 [23] and PARSEC [1] benchmark suites. The performance of each workload is normalized to the baseline architecture (Table 1) to allow unbiased averaging across workloads. Maximum temperature for a design point is the maximum across all workloads.
Discovery Metrics
The goal of our DSE study is to identify the design point(s) of interest as defined by the discovery metric chosen by the designer. Two discovery metrics are considered as case studies in this paper, but our proposed methodology is applicable to any arbitrary discovery metric. In this study the modeled design parameters are performance and temperature and the discovery metrics are:
• "Optimal": design point with highest normalized performance subject to thermal constraint temp p < T violation .
• "Pareto": Pareto optimal set of design points in thermalperformance space. Each discovery metric defines an accompanying ROI of radius φ = (φ per f , φ temp ) . The ROI for the "Optimal" and "Pareto" discovery metrics are given in Equations (3) and (4) * respectively, where per f i and temp i are the performance and temperature of design point i and Ω is the design space. Design point p is the predicted optimal feasible point for the discovery metric "Optimal". The defined ROI is the set of points within distance φ of the identified point(s) of interest.
Modeling and Simulation Parameters
The modeling and simulation technique introduced in Section 3 can be parametrized to make trade-offs between simulation time and optimality of the selected design point. In this study we use the following parameters which were found experimentally to offer favorable trade-offs:
• We sample η = 40 simulation points at random from the design space to build the initial model. This parameter should be large enough to generate an initial model with reasonable accuracy but small enough to avoid degeneration towards random sampling.
• The threshold for accepting new model terms isR 2 new −R 2 current > θ = 0. By increasing θ , the model complexity could be reduced at the expense of quality.
• We use ROI radius of φ = (8%, 4 • C) for discovery metric "Optimal" and φ = (5%, 3 • C) for discovery metric "Pareto". Larger values of φ reduce the probability of convergence to local minima, but generally increase simulation time. 
Evaluation Metrics
The goal of the experiment is to identify the ROI defined by the discovery metric, while minimizing the total number of simulations performed. Thus the primary metrics used to evaluate the quality of our technique will be accuracy of identification, number of simulations and run-time overhead of the model building.
When the discovery metric is "Optimal", the distance between the identified point and the true solution is quantified as optimality, * Equation (4) presents a φ -relaxed definition of Pareto optimality that includes all points such that no other point is better by a degree of φ in all metrics of interest. which is the ratio per fp /per fo where p is the predicted optimal feasible point and o is the true optimal feasible point (determined by exhaustive simulation for evaluation).
When the discovery metric is "Pareto", the distance between the identified points and the true Pareto set is quantified as accuracy, which is the average Pareto optimality of the predicted Pareto set. The Pareto optimality of design point k is determined by finding the smallest value of φ such that k is included in the ROI. Specifically, the Pareto optimality of k is α k and the smallest value of φ that includes k in the ROI is φ
In general the optimality/accuracy of the predicted point(s) will increase as more simulations are performed, eventually degenerating into the exhaustive simulation . The net speedup of our technique consists of the reduction in total number of simulations minus the run-time overhead of building the models.
Comparison to Other Techniques
The rudimentary technique to which our technique is compared is a random sampling methodology where some portion of the solution space is sampled at random and the best design amongst the sampled designs is selected ‡ . Additionally we could consider modeling-only version of our proposed technique that uses SS-ANOVA model building to predict the design point(s) of interest, but simply uses random sampling to provide data to the model builder. The modeling-only approach is representative of design space modeling techniques proposed in past work [8, 7, 11, 9] . In Section 5 we compare the trade off curves of simulation count vs. quality for the three aforementioned techniques:
• Proposed: modeling and directed simulation
• Modeling-Only: modeling and random simulation § • Random Sampling: random simulation (no modeling)
RESULTS
In this section we first provide some characterization of the design space explored in our study, and then compare the quality of the investigated methodologies (Section 4.5) for the "Optimal" and "Pareto" discovery metrics.
Design Space Characterization
We begin by examining the properties of the design space. Exhaustive simulation was performed for the purpose of evaluation. Exhaustive simulation took weeks to perform using university servers with over 100 cores, further motivating the strong need for techniques such as the one proposed in this paper in order to reduce simulation time. We provide some statistics of the design space properties in order to give context for the results of this study. Figure 3 shows the distribution of normalized performance across all architectural design points. We can see that the design space is biased heavily towards the low-performance region. Furthermore, thermal feasibility constraints bias the design space even further as † 60 • C was roughly the thermal range of the design space considered in this work as shown in Figure 4 .
‡ Exhaustive simulation is a degenerative case of random sampling. § Representative of past work [8, 7, 11, 9] . the constraints tighten (i.e. T violation is reduced). This implies that random sampling is not a very good technique for discovering the "Optimal" design point since the probability of randomly sampling a high-performance thermally-feasible design point is low. Figure 4 shows a scatter plot of the performance and temperature of each design point in the design space. We can see that identification of both the optimal feasible design point and the Pareto optimal design set without exhaustive simulation is non-trivial. The vast majority of design points in the design space are far from the point(s) of interest using either discovery metric. Moreover the correlation between performance and temperature is weak, motivating the need for independent models of each design property.
"Optimal" Discovery
There exists a fundamental trade-off between the number of simulations and the quality of the identified solution. We compare the random sampling and modeling-only technique to our proposed modeling and simulation technique and show that our technique significantly reduces the number of simulations required to discover the region of interest in the design space. First we evaluate the techniques using the "Optimal" discovery metric. Figure 5 plots the number of simulations required to identify a design point of interest with a specified optimality. We observe that modeling alone is a large contributor to the optimality of the identified point. With less than 1% of the solution space sampled (40 points), the two modeling techniques can already identify a solution within 90% of the optimal. However the true power of the proposed technique becomes clear as the optimality target increases. Optimality can be improved significantly for both modeling techniques by adding only a nominal amount of additional simulation samples. However the modeling-only technique begins to degenerate towards random sampling for optimality targets beyond 98% whereas the proposed technique shows no such degredation. By using models to direct simulation effort on each iteration towards the ROI, the technique is able to make roughly linear improvements to prediction accuracy for each additional simulation. Our proposed technique is able to identify the optimal feasible design point while simulating less than 2% (roughly 80 points) of the entire design space, saving 100s of simulation hours compared to the other two techniques.
Robustness to Constraint Tightness: The previous results were evaluated at T violation = 85 • C. However as Figure 3 shows, reducing T violation to 65 • C significantly reduces the size of the thermal feasibility region. It is expected that a smaller feasibility region will reduce the quality of the random sampling technique significantly, but it is unclear how it would affect the techniques that use model building. We exercise the robustness of the investigated techniques to constraint tightness by lower the thermal constraint. However, the resulting observations would be extendable to analogous situations such as using a lower quality heatsink or a more power-dense manufacturing technology. Figure 6 plots the number of additional simulations required to accommodate the reduced thermal constraint. We notice that the number of additional simulations required for our proposed method is less than 30 (<1% of the entire design space) and moreover, remains roughly constant as the optimality target is tightened. On the other hand random sampling and modeling-only both require su- perlinearly increasing amounts of additional simulations in order to meet optimality targets. Although modeling-only scales reasonable well for small optimality targets, the point at which the technique begins to degenerate is significantly reduced to roughly 95%. Thus we show that compared to a modeling-only technique, our directed simlation technique is significantly more robust to design space feasibility constraints. Figure 7 shows the accuracy of the considered methods when the "Pareto" discovery metric is applied. Although the general trends and relative ordering of the method results are similar to the "Optimal" case, there are some significant differences. The quality of both model-based techniques is reduced because identification of a set of Pareto points is a more challenging problem which inherently requires more simulation. However the relative improvement of our proposed technique vs. the modeling-only technique is substantially more obvious under this discovery metric. This indicates that, similar to constraint tightness, the proposed technique is significantly more robust to increased problem complexity.
"Pareto" Discovery
Likewise, the modeling-only technique degenerates into random sampling at an even lower accuracy target than was observed in the previous studies. The conclusion here is that models built with random sampling can approximating a single design much better than the relative ordering of all design points. Directed simulation towards the ROI is of utmost importance for estimation of the Pareto design set, even for rather loose accuracy targets.
Overhead of modeling approach
There is obviously some run-time overhead associated with modelbuilding in the investigated modeling approaches. We observed that the time consumed building models in our proposed approach was less than the time consumed to simulate a single design point (< 0.025% of the design space). Our results clearly show that such overhead is negligible compared to the savings in number of required simulations compared to random sampling.
CONCLUSIONS
In this paper we propose a comprehensive technique for 3D CPU architectural design space exploration subject to physical constraints. We use smoothing spline regression modeling to efficiently direct ¶ Data is plotted on a log-log axis, so polynomial relationships appear as a straight lines with slope proportional to degree. our simulation effort. We demonstrate out technique using two case studies. Our technique identifies the optimal feasible design point (Pareto optimal design set) with 100% (98%) accuracy while simulating less than 2% (5%) of the design space.
