Abstract ticular, industry data for a high-performance processor in a 130nm technology shows that individual dies produced with Parameter variation due to manufacturing error will be an the same fabrication equipment can have as much as a 30% unavoidable consequence oftechnology scaling infuture gendie-to-die frequency variation and a 20 x leakage power varierations. The impact of random variation in physicalfactors ation [7] . ITRS predicts that manufacturing variability will such as gate length and interconnect spacing will have a prohave an increasing prominence in future designs. found impact on not only performance of chips, but also their The large magnitude of the power variability present in power behavior While circuit-level techniques such as adapcurrent chips is expected to worsen in future scaled process tive body-biasing can help to mitigate mal-fabricated chips, technologies. This is primarily due to the exponential relathey cannot completely alleviate severe within die variations tionship between transistor gate length and subthreshold leakforecastedfor nearfuture designs. age current [28] and increasingly intensified leakage power Despite the large impact that power variability will have percentage in total power. Consequently, very small deviaon future designs, there is a lack ofpublished work that extions in this critical parameter can have detrimental effects on amines architectural implications of this phenomenon. In this the overall power profile of a chip. Statistical variations in work, we develop architecture level models that model power other transistor parameters such as gate width can also have variability due to manufacturing error and examine its influa significant impact on power consumption. Projective studence on multicore designs. We introduce VariPower, a toolfor ies have shown that physical variations in interconnects will modeling power variability based on an microarchitectural have an increasingly important influence on overall chip perdescription and floorplan of a chip. In particular, our modformance and will eventually overtake devices as a dominant els are based on layout level SPICE simulations and project source of performance variability [26]. power variability for different microarchitectural blocks us-
unavoidable consequence oftechnology scaling infuture gendie-to-die frequency variation and a 20 x leakage power varierations. The impact of random variation in physicalfactors ation [7] . ITRS predicts that manufacturing variability will such as gate length and interconnect spacing will have a prohave an increasing prominence in future designs. found impact on not only performance of chips, but also their The large magnitude of the power variability present in power behavior While circuit-level techniques such as adapcurrent chips is expected to worsen in future scaled process tive body-biasing can help to mitigate mal-fabricated chips, technologies. This is primarily due to the exponential relathey cannot completely alleviate severe within die variations tionship between transistor gate length and subthreshold leakforecastedfor nearfuture designs.
age current [28] and increasingly intensified leakage power Despite the large impact that power variability will have percentage in total power. Consequently, very small deviaon future designs, there is a lack ofpublished work that extions in this critical parameter can have detrimental effects on amines architectural implications of this phenomenon. In this the overall power profile of a chip. Statistical variations in work, we develop architecture level models that model power other transistor parameters such as gate width can also have variability due to manufacturing error and examine its influa significant impact on power consumption. Projective studence on multicore designs. We introduce VariPower, a toolfor ies have shown that physical variations in interconnects will modeling power variability based on an microarchitectural have an increasingly important influence on overall chip perdescription and floorplan of a chip. In particular, our modformance and will eventually overtake devices as a dominant els are based on layout level SPICE simulations and project source of performance variability [26] . power variability for different microarchitectural blocks usThe net effect of these manufacturing errors is that coming statistical analysis. Using VariPower, (1) we characterize ponents and chips will be increasingly prone to fabrication power variability for multicore processors, (2) explore appliinduced asymmetry where physical instantiations of cores, incation sensitivity to power variability, and (3) examine clusterconnection components, and caches on the same chip may tering techniques that can appropriately classify groups of differ widely although they have identical schematic descripprocessors and chips that have similar variability charactertions. In contrast, to architected asymmetry [21] , which can istics. be artfully constructed to balance power, throughput, latency, and area goals for a target workload, fabrication asymmetry is considerably more nettlesome. The major difficulty 1 Introduction is that many of the fundamental characteristics such as circuit power and latency for various microarchitectural structures are no longer constant. They are subject to deviaIn future technology generations, manufacturing variation tions due to imperfections in the materials and equipment will have a profound impact on the reliability, performance, used to fabricate the chip, as well as unavoidable, statistical and power consumption of microprocessor designs. Manvariance. Furthermore, the Semiconductor Industry Associaufacturing deviations due to both systematic fabrication ertion (SIA) whose forecast anticipates improvements from derors as well as random statistical variations affect gate size, vice/fabrication processes, still paints a grim picture for padopant concentration, interconnect width, spacing, and thickrameter control in deep submicron technology nodes [29] . ness. This translates directly to chips that miss critical cirMicroarchitecture can have a significant impact on paramcuit design targets including latency, power, and resilience to eter variation. Pipeline depth and chip organization can innoise. In current designs, foundry induced physical deviafluence the susceptibility of a design to parameter variation tions already produce significant die-to-die variation. In par- [7] . Furthermore, by choosing structures that can be config-ured on a per instance basis after fabrication and design styles experimental methodology and a series of case studies that that are more robust to the power, performance, and reliability explore power variation in a multicore design. We offer a disconsequences of parameter variation, designers can mitigate cussion and comparison to existing work in Section 7, and variability. In addition, cooperative strategies that consider finally we conclude in Section 8. both circuit-level implementation and architectural organization are promising because they allow for tradeoffs at many 2 Background levels of the design. To properly understand these tradeoffs it is imperative that architects have access to models that con-2. [6, 7] . tion of a high-performance microprocessor design. We target Some manufacturing processes, such as lithography and power variability as an initial target for architectural manufacchemical mechanic polishing, are fundamentally more diffiturability studies due to the emergence of power as a first class cult to control with current technology. Consequently, they design constraint [29] , and the large amount of power variaintroduce variation in the physical dimensions of devices and tion already seen in commercially available chips [7] . We interconnects. Operational characteristics for MOS transisfocus our study on the impact of power variability on a chip tors are heavily determined by relevant physical parameters, multiprocessor (CMP) design which is composed of schematsuch as gate length, gate oxide thickness, and dopant density. ically homogenous cores and caches. Due to within-die paManufacturing steps that influence these physical parameters rameter variation, these components may have fabrication inhave a larger bearing on the final result. As transition into the duced asymmetry in power consumption. Furthermore, we nanoscale era, the problem will worsen [25, 26] . This will argue that architects will also need high-level strategies for have direct impact on both the yield and quality of the final reasoning about statistical variation and classifying types of products. cores and chips with respect to their variation.
In general, manufacturing errors fall into two categories: This makes the following principal contributions: systematic and random [15] . Systematic variations can affect . will affect all the on-die devices in the same way, while ranresentative groups of cores and chips that have considerdom within-die variations will produce parameter differences able parameter variation.
that change on a device-to-device basis on a single die.
In some cases, within-die variations also have spatial corOverall, this work is one of the first to consider relation patterns [18] . For two transistors on the same die, architecture-level models for manufacturing variability. In it has been shown that gate lengths are linearly correlated addition, we offer approaches for characterizing power asymwith distance. This correlation has an important implication: metry due to process variation and illustrate the potential for neighboring devices are more likely to share common propvariation aware management.
erties. Consequently, a leaky transistor is likely to be surrounded by other leaky transistors. As a result, regional clus- within-die as well as die-to-die power variability via Monte both the leakage variation and leakage current. Different cirCarlo simulation and can project the probability distribucuit structures may consequently have very different leakage tions for power consumption under parameterized architecpower [ 1] . Detailed leakage power modeling under the stack tural models and application usage profiles. This flexibility aleffect requires analysis to identify a set of prime inputs and lows VariPower to predict the severity of fabrication induced their corresponding probability of appearing. However, one power asymmetry for a design and its consequences on difvery common building block, the 6-T SRAM cell does not ferent classes of workloads and power management policies. create a stack effect concern because it has no chained tran- Figure 1 would only cause an approximately linear variation in dynamic power, instead of an exponential one as in the case of static power. As a result, dynamic power variation is signifiIn fabricated chips, parameter variation is partially depencantly smaller than static power variation, a result confirmed dent on spatial properties of the circuit blocks [18, 27] . Typby our SPICE simulations in the following sections.
ically, nearby circuits tend to have strong parameter correThough the variation is limited, dynamic power is still a lations. As the distance between circuits increase, the corprimary source of chip power dissipation. For the purposes relations decrease. Physical geometry of microarchitectural of completeness and comparison, we continue to include dystructures will have a significant impact on their statistical namic power in the models that we propose. However, architectural design options such as pipeline be re-instantiated and stamped down anywhere on chip. This depth and cache sizing often dictate the overall perforallows for easy representation of tiled architecture as depicted mance and power characteristics of a processor. Furtherin Figure 2 . In this example, a simple processor core is demore, they frequently place bounds on which circuits can fined in terms of its major subcomponents: caches, router, be used to implement functionality. In addition, pure circuit and execution pipeline. The components are grouped to form approaches such as adaptive body-biasing (ABB) [33] have a core. The core is then replicated many times to describe a In our circuit characterizations we assume that within a ciramps, decoders, and drivers. Each of these functional subcuit, macro dimensional parameters of wires are perfectly corsections could be represented by its own circuit macro block related, and that dimensional parameters of devices are also within the larger cache structure.
perfectly correlated. This simplifying assumption is reasonable because physical parameters of circuit neighboring structures have been shown to be strongly correlated [18, 27] . This is a result of imperfections in a manufacturing step, for ex- where the overall variance of a parameter is a function of lated values is simple yet flexible. We can essentially change its location on the chip. The global variance is determined the rate at which the correlation approaches zero by changby or2D and it relates the overall deviations that are present ing the size of the convolution sum. By changing the aspect across all fabricated dies. The second term expresses the sparatio of the convolution box, we can also change the horizontial correlation of the physical parameter. Empirical studies tal and vertical correlation factor. In addition, we can model have shown that critical parameters can have strong positive non-monotonic correlations with more irregular shapes. Furcorrelation for two neighboring points [18, 27] . With this simthermore, by allowing the convolution sum to "wrap-around", ple, yet flexible model, VariPower can effectively model comwe can model concave correlation patterns [18] . mon types of statistical parameter variation.
The last step in generating P is to add a single random number (,u O,(= D2D) to all elements of the matrix. This represents the global sample-wide parameter deviation. The entire process is repeated to generate variation matrices for all physical parameters that VariPower models.
.,,,,,A,/,,,,,,.,,.gXX,z,,/,/'
To compute the local dynamic and static power variation is very similar to 2D convolution kernels used in image mamodels for each core to generate a cross product of usage patnipulation. The procedure for generating these random, correterns and cores. From this cross product it can select entries that represent interesting user-defined scenarios. For architecture-level power models, the emphasis is traportion of an adder implemented in dynamic logic.
ditionally placed on fidelity rather than absolute accuracy. In models to project the deviation under parameter variation. In this way, architectural models can be used to help guide early the evaluations in this paper, we apply the later mechanism. stage design decisions without the complexity and detail that At present, VariPower does not have enough representative would be essential under an absolute accuracy requirement. circuit blocks to provide absolute, overall power projections VariPower is designed to produce high-fidelity projections on for an entire processor. We therefore use a slightly modified power variability.
version of Watt ch [9] Figure 7 shows the resulting gate length in the PowerPC 603 [32] . As we continue the development of correlation. Note that our convolution based parameter genVariPower, we hope to extend this list to cover more circuits. eration is capable of producing a close facsimile of the emWe anticipate that the use of SPICE simulations with interpirical findings in [18] . In Figure 6 , we present two samples connect resistance/capacitance extracted from actual layouts of a four core CMP modeled using our correlation method. would provide a sufficient accuracy for these blocks. Figure 5 The two chips have very different gate length variation patshows the layout for two macro blocks used by VariPower. In terns. This underscores the impact that local paramater variathe process of assembling these models, we sanity-checked tion will have on multicore power. for correct functional operation under the target clock freOur second validation examines chip-wide power.
quency.
VariPower allows us to model both dynamic and static VariPower can generate power estimates under variation power variation. In the literature, we could not find many using two different mechanisms. Under the first mechanism, reported figures for dynamic power variation. Nassif notes we directly apply the block level power estimates to calculate that the impact of manufacturing variation on this topic absolute power for a given processor model. The benefit of has not received much attention [15] . One of the benefits this approach is that the power variations and absolute power of VariPower is its ability to give a comprehensive power are tied to the same underlying circuit-level implementation.
projection. In Figure 9 , we present our estimates of dynamic
Under the second mechanism, we use existing power simulaand static power variation for a four core chip multiprocessor tors to form a baseline power estimate and apply VaniPower which we describe in detail in Section 5. Our results focus Figure 6 : Gate length variation in two chip samples. The chips repwhich includes detailed models of pipelines, caches, buses, resent two physical instances of the four core CMP described in Secand off-chip memory. We extend M5 by modeling nominal tion 5.
power under parameter variation as described in Section 3. Power Parametersl on within-chip parameter variation. We note first that the VDDClock T o3yGHz
dynamic power variation is limited in comparison to the Feature Size 65nm static power variation. In addition, the leakage distribution is skewed, with a small number of chips that have very large Table 1 : Processor Parameters leakage factors. We also see approximately a 4 x variation in leakage power. These results are all comparable to those reported in [4] . However, the relative spread in leakage is 5.2 Workloads much smaller than the 20 x variation described in [7] . We still believe that our projections are reasonable because they To evaluate the efficacy of VariPower, we use several reflect only within-die parameter variation while the samples workloads that showcase a variety of hardware usage patterns. studied by Borkar also had substantial die-to-die parameter Individual applications are taken from the SPEC CPU2000
variation. Die-to-die variation is known to make a major benchmark suite. To reduce the total number of simulations, contribution to total variation [8] .
we identify a subset of SPEC applications which exhibit a range of power and performance characteristics and then fo-5 Experimental Methodology cus our case studies on these benchmarks. VariPower on a single CMP design.
6 Results
Processor Model
In this section, we conduct a series of case studies using Our experiments model power variability and performance
VanPower. These studies serve as examples of the kinds of of 4-core homogenous chip multiprocessors for a 65nm proearly stage studies that VaniPower can perform. in these case studies. The floorplan itself is borrowed from stdev/mean 0.1105 0.0729 0.0552 0.0467 Skadron et al ( [31] ) and is a rough approximantion of an AlFPReg mean 1.154 1.079 1.032 1.000 pha 21264 processor core. We also base our floorplan for our stdev/mean 0.1258 0.0708 0.0497 0.0412 four-core CMP on work by Kumar ([22] ) as shown in Figure   FPAdd In VariPower, Monte Carlo analysis is used to simulate the variations of five process parameters: gate width, gate length, Table 3 : The power distribution of the same microarchitectural strucwire length, wire height and inter-wire distance. In this study, tures in different cores. we focus on within-die variation, and we include no additional die-to-die variation. For the 65nm predictive technolleaky FPUs because any reasonable leakage power manageogy model [13] used in our SPICE simulations, we assume ment strategy would transition the FPUs into a low-power a 3(7 variation of 9% deviation of nominal values for gate state. Note: This assumes that there is little or no difference width and gate length, and a 3cr variation of 15% deviation in maximum operating frequency for the chosen core. Based of nominal values for the remaining process parameters. The on the effects that a large number of critical paths have circuit whole chip is divided into a 1024 x 1024 grid. The devices delay [8] , this is a reasonable assumption. This assumption is in the same grid region are assumed to have perfect correlaconfirmed by other high-level models [17] . tion. Furthermore, correlation between devices in different grid sections linearly drops as the separation increases as illustrated in Figure 7 . The Monte Carlo simulations produce Table 3 presents the normalized mean leakage for floating 10,000 samples.
point resources across all the cores in our CMP design. For each resource, we rank the structures by decreasing leakage 6.1 Case Study 1: Core-To-Core Power power. We can see overall, that for a given resource type, the Variations leakiest structure is considerably leakier than the least leaky.
On average the most power-hungry resource uses 16% more As an example of how variability affects microarchitecpower than the corresponding least power-hungry resource of tural structures within a core, we compare the static power of the same type. This suggests that their might be some opfloating point resources in each core of our CMP. We choose portunity for assigning application threads to cores based on floating-point resources because they are not used by integer their resource usage and the chip leakage profile. We also applications and hence are a likely candidate for leakage mannote that in general, the ratio of leakage power for cores deagement techniques such as standby power modes or powercreases in the same fashion for the all the functional resource gating [16, 20] . The insight is that if there is significant varitypes that we study. What is not evident in the table, is that ation in power across cores for a given functional unit, we when a given structure suffered from a higher leakage factor, may benefit from selecting specific applications to run on apother structures in the same core did as well. This can be expropriate cores. For example, an application that does not pected due to the strong spatial correlation factors discussed utillize floating point resources could be run on a core with in Section 3. other programs. As stated in Section 6.1, static power of a miUnder power variability, each core may have its own leakage croarchitecture structure normally has much larger variation, profile for a given resource. Consequently, different assignand it is actually expected that a program using less dynamic ments of threads to cores may yield different power savings power has a better chance to achieve arger power savings, as when leakage management is applied. This constitutes an opillustrated in Figure 10 . portunity for core-to-core savings under leakage asymmetry.
For a given core, there are typically functionally identical When combined with other power management mechastructures that may be available to a thread. If the current apnisms, in-core power variations also provide power saving plication does not require all of those resources, there may opportunities if proper assignment can be made. A focused be a choice of which resources to use and which to transition power control strategy would choose to power-gate the right into a low power standby mode. Selective cache ways is an microarchitectural units to minimize performance loss, obexample of a power saving strategy to which this may apply tamning a better power-performance balance. We study such [3] . Traditionally, structures are considered equivalent from an example using the benchmark twolf in the remaining part a power savings standpoint. However, under parameter variaof case study 2. tion, there may be a considerable difference in leakage power for two structures that provide identical functionality. For exDetailed simulation shows that closing 2 of 16 ways of L2 ample, one cache sub-array may be leakier than a neighboring Cache and half of the Li Dcache would only cause a 2% persub-array. This is an example of a within core savings under formance loss for twolf. In Figure 11 the three color bars leakage asymmetry.
show the achieved power savings under three scenarios: (1) We used VaniPower to model the impact that within-core selecting the most power-efficient, (2) random and (3) most and core-to-core resource selection can have on power. the scenarios in which twolf runs on the most power-efficient each benchmark, the left bar corresponds to the best situation core, a randomly selected core and the most power-hungry in which the application is assigned to the core that consumes core. On average, the best selection achieves 12% more minimal power and the right bar is the opposite scenario power savings over a random selection and 23% over the which exhibits the worst result. The central bar shows the avworst choice. Additionally, we see that the benefit is larger erage power usage when the application is randomly assigned when the within core resource selection is used on a cores to a core. All the results are normalized with respect to the that have higher overall power. binning, our goal is to identify a set of chip instances that 7 Discussion and Related Work have similar core-level profiles. In the case of power variability studies that we explore in this paper, we want to identify Process variations and its impact on system performance groups of chips that appear the same with respect to their core and reliability have gained much attention in the research power consumption under variation. This knowledge could be community in recent years. Borkar et al. in [7] discuss comused to partition a sample space for further study. In essence, mon parameter variations observed in today's industry and multicore binning can be thought of as a way to apply some their impact on circuit and microarchitecture. This work also order to the mountain of data that emerges from Monte Carlo describes current challenges at the circuit level and offers opsimulation.
portunities for architects to help.
In this case study, we explore the use of clustering a sta-
In an effort to better understand and describe the underlytistical data mining approach that groups and organizes muling physical mechanisms behind parameter variation, recent tidimensional data. In particular, we apply the k-means alwork has examined use of statistical models. al. developed a model to estimate the variation of chip leakFor a given N, the k-means algorithm groups the given age current due to gate length process variation. In [14] , the multidimensional data into N clusters. Each cluster contains authors established a similar model with additional considera collection of data items that share some similarity, usually ations on oxide thickness variability and process parameter measured by a distance function (e.g. Manhattan distance or correlations. In [1] , random dopant fluctuation is further inEuclidean distance). We applied k-means clustering to idencluded in estimating the leakage variation. Bowman et al. [8] tify chips that have similar core leakage profiles, using Eudeveloped a model describing the maximum clock frequency clidean distance as a similarity criteria. For each chip in our distribution of processors. This model was demonstrated to sample population, we first sort the core leakage values in asbe extremely accurate when compared with wafer sort data. cending order. This allows us to compare the cores from difRecent research [18, 34] has taken a closer look at the calferent chips using a consistent rank. We explored the benefit ibrating models against real, fabricated chips. The authors in of clustering for three values of N: 3, 5, and 7.
[18] physically measured the critical dimensions on an indus- Figure 12 summarizes the clustering results by presenting trial processed wafer using ELM and successfully observed a centroid for each cluster. For N=3, there are two large clusthe strong correlation of gate lengths. In [34] the authors imters which comprise almost 90% of the population. The third plemented special testing structures and electronically meacluster which represents 9.95% of the population is distinsured leakage currents. As work in this area continues, we guished by a much higher overall leakage Figure (reaching will benefit from higher fidelity parameter variation models. 22W), and features a very large leakage value for one of its While much progress has been made on modeling and adcores.
dressing variation problems at both the device and circuit 1ev-For N=5, the centroid with the largest total leakage values els, microarchitects are only begining to examine the prob- Figure 12: Core leakage power binning using k-means clustering. N denotes the number of clusters. Bars represent core and chip leakage power for the centroid of each cluster. Percentages represent the size of the cluster relative to the entire sample population (10,000 chips).
lem. Humenay et al. develop a model for power and perforpartial validation of our model against published results. Fimance variability for mulitcore chips [17] . The major differnally, we provide a series of case studies that explore the poences between their power model and ours is that we build on tential for power variability analysis at the microarchitecture SPICE level macro blocks, and we also model interconnect level. related variations and dynamic power. In addition, we have augmented VariPower with a very flexible model for model
