Abstract
Introduction
As the silicon technology keeps on shrinking to 45nm and beyond, variations in manufacturing process parameters manifest themselves in all forms. These process variations can cause inter-die or die-to-die (D2D) variations, and intra-die or with-in-die (WID) variations. Moreover, these variations consist of both systematic (correlated) variations and non-systematic (random) variations. In general, process variations can have an unpredictable impact on the power, performance, and reliability of the systems.
Literature surveys [4, 11] reveal that the effect of process variations is profound on the power dissipation and performance of chips. Recent studies show that even in a relatively mature technology like 130nm, these variations are known to result in as much as a 30% decrease in maximum frequency and 500% increase in leakage power [9] . For newer technologies, these variations can be even higher: a 20-fold increase in leakage has been reported for 90nm technology [4] . A direct impact of this phenomenon is reduced chip yields. A chip may under-perform or dissipate power greater than a certain threshold and hence may eventually be dropped resulting in effective yield loss. Clearly, another important impact is the variation of power on the remaining chips. In this work, we try to optimize for dynamic power under process variations in chip multiprocessors (CMPs).
CMPs are the latest trend in chip industry. With process technology advancements, multiple cores are laid down on the same die to exploit parallelism in applications and provide higher performance. Just like their single processor counterparts, CMPs are no exceptions with respect to D2D and WID variations. In fact, in CMPs, the problem of parameter variability is more acute because rampant WID variations may result in core-to-core (C2C) variations [16] . As a result, the performance of certain cores drop beyond the expected level and a nominal frequency of operation is chosen to be equal to the frequency of the slowest core. Besides, D2D variations also cause chips to differ from each other. Under such circumstances, having a single V dd level for all the manufactured chips is power-inefficient, since there are significant variations between chips manufactured in the same batch. Intuitively, power savings can be achieved by setting a customized V dd for each chip, or a set of cores in a chip thus forming one or more voltage islands. In this paper, we try to investigate the impact of process variations on multicore chips and mitigate the powerinefficiencies by using single/multiple voltage islands.
Particularly, we make the following contributions:
• We develop an extensive model, which encompasses process variations for a CMP using statistical estimations and the detailed floorplan for Alpha EV7-like cores.
• We develop a variation-aware scheme for power optimization using single/multiple voltage islands across different cores in a CMP.
• We analyze varying voltage island granularities and show that depending on the technology, even a single voltage island can reduce the power consumption significantly.
• Finally, we formulate an analytical model that can be used to estimate the advantages of voltage islands for different manufacturing processes.
Overall, our results show that the multiple voltage island scheme results in up to 36.2% power reduction in our target architectures. A single voltage island, on the other hand, can save up to 31.5% of the dynamic power. The remainder of the paper is organized as follow. Section 2 describes the models for power estimation under process variations. In Section 3, we present the multiple voltage island scheme aimed at dynamic power optimization. Section 4 illustrates the experimental results followed by the discussion of related work in Section 5. The paper is concluded with a summary in Section 6.
Architecture and Process Models

Architecture Modeling
Our goal in this study is to evaluate the impact voltage islands in a CMP system, where each core or a set of cores will be able to set its supply voltage such that it will continue correct operation with optimal power consumption. In this approach, we need to examine the impact of changing supply voltage levels on latency of the critical path of the processor. To model the critical path, we have taken into account the 7-stage pipeline of an Alpha-21364 (EV7) processor. The main components of our processor model are the issue queue, the register file, the integer execution units and the memory hierarchy consisting mainly of the L1 data cache. These structures are known to be the critical components in high-performance microprocessors and hence are included in our study, whereas the remaining structures are omitted to reduce the simulation time. The detailed modeling description of the individual microarchitectural units can be found in [8] .
To understand the impact of variations on our processor architecture, we have also analyzed the latency distribution of different architectural components under process variations (the variation models are described in the next section). In Figure 1 we plot the latency distributions for different architectural units under process variations. The figure also shows the cumulative latency distribution (i.e., the latency of the entire chip) determined by the longest latency component for each modeled chip. We must highlight that the studied components have equal latency before the introduction of process variations. The results clearly reveal that the latency distribution of the cache dominates the cumulative latency distribution of the chips. For a set of 2000 chips we simulated, 58.9% of the critical paths were found to lie in the L1 cache.
The fact that caches are most vulnerable to unit-to-unit variations can be explained by several reasons. First, level 1 caches have a high frequency requirement and consequently tend to utilize low threshold voltages [3] . Second, according to the "FMAX" model introduced by Bowman et al. [5] , the number of independent critical paths (N cp ) and critical path logic depth (L cp ) are two factors determining the criticality of a component. Thus a unit having a high N cp to L cp ratio, will have a larger variance to mean ratio in its delay distribution, leading to an increased susceptibility to process effects. SRAM structures have a high number of critical paths with low logic depths in those paths, making them highly susceptible to process variations. In fact, Humenay et al. [17] have demonstrated that unit-to-unit variations will be dominated by SRAM structures under process variations. Figure 2 depicts the frequency slowdown of several microarchitectural components having different N cp due to random process variations. It is clear that SRAM structures with a high N cp value have a significantly higher chance to contain the critical path of the processor. This analysis shows that the designers can employ voltage scaling techniques that treat the cache separately.
We must note that a change in the supply voltage will not affect the cache only. Regardless of the critical path selected, if the supply voltage of a core can be reduced, the power consumption of the whole core will be reduced.
Modeling Process Variations
Process variations can be defined as statistical variations in circuit parameters like gate-oxide thickness, channel length and Random Dopant Effects (RDE) due to the shrinking process geometries [4] consist of D2D and WID variations. D2D variation refers to the variation in process parameters across dies and wafers, whereas WID variation is the variation in device features within a single die, causing non-uniform characteristics inside a chip. Independent of their type, process variations generally fall into two categories: spatially-correlated variations where devices close to each other have a higher probability of observing a similar variation level, and random variations causing random differences between various devices within a die. To measure the impact of process variations on our processor model, we considered 5 different variation parameters. These are interconnect metal thickness (T), inter-layer dielectric thickness (ILD-T or H), line-width (W) on interconnects, gate length (L gate ), and threshold voltage (V th ) for the MOS devices. The variation limits for these parameters are given by Nassif [20] . The mean(µ) and 3σ values for each source of variation are listed in Table 1 .
We model both spatially correlated and random process variations for our architecture. To take into account the spatial correlation we use a range factor (φ) in the 2D layout of the chip. Thus each process parameter x can be expressed as a function of its mean(µ) and variation(σ) and the range(φ) values as shown in Equation 1 .
illustrates the role of the range parameter in determining the correlation coefficient. If in a 2D plane two points x i and y i are separated by a distance d i , then the spatial correlation factor C i between them can be thought of as an inverse linear function involving φ and d i . Note that there is no correlation between two spatial points which are more than φ units apart. With this background, we have generated a spatial map of various parameter values using the R statistical tool [2] . To generate process parameters for a multicore chip, we have replicated the Alpha EV7 floorplan described in the HotSpot tool [25] to form a 16 core CMP. Random variations caused by RDE mainly manifest themselves as random changes in V th under process variations. Hence to model random variations we use random values from a uniform random distribution to augment the spatially correlated values. The amount of random variations in each parameter can be different and is set according to the results presented in previous work [15] . Compared to spatially-correlated variations, the magnitude of random variations remain lower and generally does not exceed 30% of the overall variations. Note from Figures 3(a) and 3(b) that in our model the random variations depend on the φ value. For a small φ value (0.3), the parameters are highly random, whereas for a large value of φ like 0.7, the param-eters are highly correlated. A φ value of zero would imply totally random variations.
A set of values are generated for each process parameter and are fed into the parameterized SPICE models described in Section 2.1. A batch of 2000 chips are simulated from these models. Thus once the simulation of a chip (precisely the 16 cores within a single chip) is completed, we generate a new set of parameter values corresponding to another chip. This automated procedure accurately simulates the WID variations for each chip. Since we pick the initial parameters from a normal distribution, the effects of D2D variations are also captured.
Power-Aware Multiple Voltage Islands
Methodology
This section presents an overview of the power-aware multiple voltage island scheme for CMPs. When processors are manufactured, they operate at a voltage level set during the design of the processor. This voltage level called the nominal voltage is usually chosen at the design time. However, under process variations setting a constant level for all the manufactured chips is considerably inefficient. First, different chips will have different latency slacks, which can be taken advantage of by customizing the voltage level for each chip. In addition, if we consider C2C variations, different cores will tend to have different latencies. In such cases, the operating frequency of the whole chip is determined by the maximum latency across all cores. It is known that the dependency of delay (D) or latency on the supply voltage is given by:
where V th is the threshold voltage and α is technology constant varying between 1 and 2. Equation 3 implies that cores which have a latency lower than that of the slowest core (nominal delay), can increase their latencies by scaling down the V dd in steps till they reach some minimum value. We refer to this voltage as the minimum stable supply voltage (V opt ). Beyond this point the circuit operation fails. On the other hand, nominal supply voltage can be defined as the voltage set during design time which gives the desired latency for the set of manufactured chips. Our experiments indicate that for most cases the supply of one or more cores can be reduced below the nominal V dd value. This optimization can significantly cut down static and dynamic power dissipation, hence lowering the energy of the whole system. Thus, in a multicore system a single core or a group of cores can be clustered on the basis of critical latencies and assigned a custom supply voltage. Such clusters with a customized V dd can be referred to as a voltage islands. In general, if there are k voltage islands in a system having a nominal clock frequency of f clk and a corresponding supply voltage V dd , the dynamic and leakage power savings can be denoted by:
where I leak k and C load represents the average leakage current and load capacitance for each voltage island. Since our target CMP includes 16 cores, k can take values from 1 to 16. In the former case, the entire multicore system operates on a customized V dd , while in the latter case each core has a different supply. One way of implementing the multiple voltage island scheme is to have a configurable DC-DC voltage converter in each voltage island. Once the chip has been tested the voltage levels for each island will be set once; in that way a dynamic adjustment can be avoided. Besides, several software tools allow user-level voltage control to change the supply particularly in mobile processors [1] . This concept can similarly be extended to multicore systems. The extra overhead is going to be in the form of a supply voltage table keeping the voltage specifications for each island or core. The kernel is going to use this table during the boot operation. Alternatively, hardware mechanisms like [12] [13] [14] can be easily adapted for this purpose.
Modeling Power Optimization
In this section, we develop a model that can predict the amount of power savings for a given manufacturing technology. In the core of the model lies the observation that latency and voltage levels are correlated. For example, if a circuit operates at 8ns and our frequency requires 10ns operation, we can reduce the supply voltage until the latency is 10ns. Thus for a particular initial latency value (l), a corresponding minimum stable voltage (henceforth called optimal supply voltage V opt ) level exists that guarantees correct operation and results in the minimal power consumption. Since this metric depends on the latency, we first have to extract the relation between the latency and optimal supply voltage: V opt = h(l). Note that this function is circuit-specific. For our target architecture, we have first plotted the latency (l) versus the corresponding V opt values as shown in Figure 5 . Then using curve fitting techniques the function h is found to be:
otherwise (6) There are two important aspects of function h. First, our analysis of our circuit revealed that it does not work below 615mV (note that the nominal voltage level is 900mV). In addition, function h depends on the cutoff point set by the designer. This cutoff point corresponds to the frequency that the processor will run at and will be set by the designer. Thus using function h we can compute the value of V opt . Since dynamic power is proportional to the square of the supply voltage, from Equation 4 we get dynamic power savings for a chip with latency l as:
We also need the latency distribution for the batch of chips manufactured to be able to understand the advantages of a voltage island scheme. Assuming that this distribution is Gaussian (Figure 6 ), we can formulate the probability of a chip having a certain latency by g such that:
where µ and σ are the mean and standard deviation of the latency(l) distribution. Note that these values can be estimated for a given manufacturing technology. Thus the average dynamic power dissipation (P) for a batch of chips with latency distribution g(l) can be given as:
Hence, given µ and σ of a distribution, Equation 9 can be used as an analytical model to compute dynamic power consumption with V opt . In Section 4, we show that this model is highly accurate to estimate the optimized power consumption levels for our studied manufacturing technologies. Note that for a different technology, this model can be used by only providing the µ, σ, and cutoff values, which are easily available.
Results
In this section we present the power optimization results for different voltage island schemes and also analyze how accurately can the power consumption with voltage islands be predicted using our model. We conducted SPICE simulations on the circuit model described in Section 2. Since Figure 7 illustrates the power savings for different voltage island schemes with different amounts of randomness in variation. It shows the percentage of the power saving compared to the processor that uses nominal voltage (900mV) in all its cores. For a highly random case (φ = 0.3), the dynamic power improvement can lie between 13.5% and 35.1%. For the highest correlated variations (φ = 0.7), on the other hand, the improvements range between 31.5% and 36.2%. For the φ = 0.5 model, we see that the dynamic power reduction is between 30.5% and 36.2%. We can reach two important conclusions from these results. First, customized supply voltage levels can be an attractive mean to reduce the power consumption in CMPs under process variations. Particularly, we see that the dynamic power consumption of the chip can be reduced by as much as 36.2% on average, which is achieved when each core is individually controlled (16 voltage islands) . Second, depending on the manufacturing technology, even a single customized voltage for the whole chip can reduce the power consumption significantly. Particularly, for the φ = 0.5 and φ = 0.7 models, we see that a single customized voltage level can reduce the power consumption by 30.5% and 31.5%, respectively. Only when the spatial correlation is diminishing (φ = 0.3), we need individual control of the cores: for the φ = 0.3 model, a scheme that uses 16 voltage islands can save 35.1% of the dynamic power while the single voltage island scheme reduces the power consumption by only 13.5%.
Another interesting trend we observe in the results is that voltage islands having the same number of cores have almost same energy savings. For example, 2-vertical and 2-horizontal voltage islands have similar power savings.
Accuracy of the model:
We compare the results obtained from the analytical model (Equation 9) with the empirical data from our experiments. The average error in P for φ values 0.3, 0.5, and 0.7 are found to be 0.01%, 0.30%, and 0.44%, respectively. Thus our model gives highly accurate estimations of the dynamic power consumption.
Related Work
Parameter variations have lately been a topic of interest in both industry and academia. Power optimizations under process variations have also been studied by several researchers. Previous works have proposed several circuitlevel techniques to counter the negative effects of process variations [4, 6, 10] . The inter-and intra-die process variations and their effects on circuit leakage is studied in detail by Rao et al. [23] . In another work, Rao et al. [24] analyze the impact of process variations on circuit leakage and propose methods to reduce them. Most of these techniques focus on analyzing the design statistically or by using static timing analysis, and then modifying the parts of the circuits that are most susceptible to variations. Ozdemir et al. [22] have proposed architectural techniques to improve chip yield under process variation effects. Besides, many gate-sizing strategies have been used on the critical or near critical regions of the circuit in order to reduce the effective latency [9] .
Variable Voltage/Frequency Islands (VFI's) have been previously used by other researchers [7, 18, 19] . Marculescu et al. [18] show that VFI-based latency-constrained sys-tems are more likely to meet timing constraints than Single Clock, Single Frequency (SSV) based systems. In another work, Marculescu et al. [19] have suggested a GALS like architecture with multiple voltage islands for energy awareness under parameter variations. Dhar et al. [12] have designed a controller-based adaptive supply voltage scaling (AVS) mechanism for standard cell ASICs. Niyogi et al. [21] have addressed the issue of using multiple VFIs for energy optimization in media and signal processing applications. These works, although important and showing the advantages of customized voltage islands, do not study the CMPs but concentrate on application-specific processors. In a recent work, Humenay et al. [16] have studied the effects of core-to-core (C2C) variations on power dissipation and yield of chip multicore processors. The authors have investigated the effects of systematic variations on dense and distributed floorplans of a CMP, and used Adaptive Voltage Scaling (AVS) techniques to boost the performance of slow cores. Our work, on the other hand emphasizes on the importance of multiple voltage islands in a CMPs, to reduce power dissipation, and performs a detailed analysis of the advantages for various voltage island formations. To the best of our knowledge, there has been no previous work in analyzing the impact of multiple voltage islands on CMPs under process variation.
Conclusion
In this work we analyzed the effects of parameter variations on CMPs with an emphasis on the power dissipation. We presented a variation modeling technique which involves five different variation parameters affected by both systematic and random variations. We have first described an accurate model that can be used to estimate the advantages of forming voltage islands. Our simulations indicate that a custom supply voltage is more effective than a predetermined nominal V dd for the entire chip. Particularly, application of multiple voltage islands with a latency constraint cuts the power dissipation of CMPs by as much as 36.2%. We also show that for most manufacturing technologies, even a single customized supply voltage for the whole chip can reduce the power consumption substantially.
