Abstract-Near-threshold computing (NTC) has the potential to significantly improve efficiency in high throughput architectures, such as general-purpose computing on graphic processing unit (GPGPU). Nevertheless, NTC is more sensitive to process variation (PV) as it complicates power delivery. We propose GPU stacking, a novel method based on voltage stacking, to manage the effects of PV and improve the power delivery simultaneously. To evaluate our methodology, we first explore the design space of GPGPUs in the NTC to find a suitable baseline configuration and then apply GPU stacking to mitigate the effects of PV. When comparing with an equivalent NTC GPGPU without PV management, we achieve 37% more performance on average. When considering high production volume, our approach shifts all the chips closer to the nominal non-PV case, delivering on average (across chips) ≈80% of the performance of nominal NTC GPGPU, whereas when not using our technique, chips would have ≈50% of the nominal performance. We also show that our approach can be applied on top of multifrequency domain designs, improving the overall performance.
I. INTRODUCTION
N EAR-threshold computing (NTC) is a circuit design technique used to reduce the power envelop used by a design and thus improve energy efficiency by reducing the operating voltage to near the threshold voltage of the transistors employed [1] . Although more energy efficient, NTC devices usually do not present as high performance as their nonnear threshold counterparts. However, it has been shown that the performance impact resulting from NTC can be mitigated through parallelism. An ideal candidate for such operation is a GPU [1] - [3] . Nonetheless, NTC makes the system more sensitive to process variation (PV) [4] .
To manage the additional sensitivity to PV introduced by NTC, some researchers have proposed frequency scaling [5] or having multiple voltage domains [6] . Having multiple voltage domains requires additional power rails, which will further exacerbate the current delivery problem. Multifrequency domains [7] have been shown to moderately mitigate the problem of PV effects on energy and performance in many-core NTC designs.
Voltage stacking improves the efficiency of power delivery [8] . When n units are stacked, they are placed in a series fashion, rather than the conventional parallel scheme. Thus, the current in the power delivery network (PDN) is reduced by a factor of n in a system. This allows voltage regulators (VRs) with increased efficiency, smaller areas, and fewer package pins dedicated to power [9] . Voltage stacking has been applied to CPUs [8] , GPUs [10] , and SRAMs [11] .
However, in this paper, we look at voltage stacking from a different perspective. We note that voltage stacking can compensate for PV effects. The proposed GPU Stacking methodology lets the voltage node between the stacked elements (V MID ) float. 1 This floating node is the key to PV compensation. GPU Stacking alleviates the current delivery challenges and intrinsically mitigates PV effects without requiring multiple voltage domains. GPU Stacking automatically creates a voltage domain per level in the stack without the cost of multiple power rails. We build on top of this premise and discuss how it can be leveraged for managing PV.
Voltage stacking of many cores has its own challenges, among which is the load mismatch between the stacked cores [9] . Cores go through different phases while running applications, which can result in a transient impedance mismatch of the stacked cores and yield timing failures. As a result, stacking is successful when the cores have a matching workload. General purpose computing on graphic processing units (GPGPUs) are instances of such designs. Not only are the cores identical, but the applications running on them are roughly homogeneous. 2 The evaluation of GPU Stacking is carried out in the nearthreshold region. Although the use of this method in the nearthreshold region is proposed, it is not a requirement. The use of NTC in this paper is justified by the increased sensibility of NTC to PV effects. The first part of this experiment consists of finding the ideal GPU configuration for the NTC region. By carefully sizing the GPU to NTC, power consumption is reduced by 43% with only 4.8% performance degradation compared with the baseline.
Based on our experiments, there is a potential for self-balancing in the stacked configuration. We observe that stacking of cores with opposing PV trends is a better choice to gain the best self-balancing results. This paper proposes the stacking of SIMD Lanes, because having a large number of lanes provides more opportunities for PV compensation. Analyzing several PV maps, GPU Stacking shows that it is able to deliver, on average, ≈80% of the nominal performance in a multifrequency domain scenario (as opposed to ≈60% in a nonstacked configuration). This represents a reduction of about 30% in the effects of PV even after using a multifrequencydomain-based compensation of PV effects. When not considering the multifrequency domain scenario, the reduction in PV impact is even more important, and around 37% of the PV effects are compensated. This compensation is mainly due to an uneven voltage distribution that considers the PV of cluster of lanes. For a nominal voltage of 0.6 V, the observed voltage was in the range of 0.602-0.65 V for lanes negatively affected by PV and 0.55-0.598 V for lanes positively affected by PV.
The floating V mid node could be a source of problems in cases of extreme load mismatch or high temperatures. Thus, we propose the use of dummy activity (DA) and a small VR to keep the voltage within safe operational margins (i.e., to avoid voltage starvation in one of the levels). We use SPICE simulations to verify the reliability of the PDN and show that GPU Stacking does not incur extra voltage noise. In our simulations, DA and the additional VR were not required, given the stability of the operational voltage.
The main contributions of this paper are as follows: 1) the first study to show how voltage stacking alleviates PV effects; 2) the first study proposing stacking with an uneven voltage division (V MID ); 3) a study to propose a technique to make the postsilicon configuration of the design feasible.
The remainder of this paper is organized as follows. In Section II, we discuss the related work in a few different topics: voltage stacking, PV mitigation effects, NTC, and energy efficiency in GPUs. Then, in Section III, we briefly present the background information needed to the understanding of this paper. We present the GPU stacking model in Section IV. Finally, we discuss the experimental setup in Section V and results in Section VI. We wrap up on Section VII.
II. RELATED WORK
We divide our related work into a few categories. We start discussing prior approaches for voltage staking. Then, we follow presenting other techniques to mitigate PV effects. Finally, we discuss work on NTC and energy efficiency for GPUs.
A. Voltage Stacking
As supply voltage decreases, the efficiency of power delivery components degrades [9] . On-chip VRs [12] have been proposed to increase the PDN efficiency, as well as the series configurations of units rather than parallel [8] , [13] , [14] . Such configurations are known as multistory power delivery [13] , charge recycling [14] , or voltage-stacked systems [8] . Our proposed technique, GPU Stacking, depends on a series configuration of cores, yet for a different purpose. Note that no previous research on voltage stacking exploits the stacking method to control or neutralize PV effect.
B. Process Variation Mitigation
PV increases as feature size shrinks. In addition, lowering V dd , as a power management technique, further exacerbates the PV effects. Lee et al. [15] study the impact of frequency variation on the throughput of a GPGPU. Adaptive body bias (ABB) leverages the power-performance tradeoff to manage PV effects [16] . Slower devices due to PV can run faster by consuming more power, and vice versa. Adaptive supply voltage (ASV) is another technique, where supply voltage of a region in the design is adjusted to compensate for performance loss due to variation. ABB and GPU Stacking are orthogonal techniques, and could be used together, although ABB efficiency is expected to reduce with technology scaling [7] . Similar to ASV, our technique provides a custom supply voltage to each stacking cluster to compensate for PV.
C. NTC
An extensive amount of research targets increasing the power/energy efficiency of processors. Apart from the many proposed techniques of how to reduce the utilization of resources [17] , [18] or how to promote the use of more powerefficient structures [19] , [20] , there are a significant number of proposals that attempt to utilize the power-performance tradeoff. DVFS and power gating are among the techniques widely studied at an architectural level for the same purpose [21] , [22] . Intel Turbo Boost technology is another example of utilizing voltage and frequency scaling to adapt to runtime conditions. Dreslinski et al. [1] study devices for near-threshold operation, and Chang et al. propose the optimization of device parameters for NTC [2] . They propose a slightly modified SRAM cell to address the stability challenges introduced in near-threshold regions. Lower V dd exacerbates the effects of PV. Another study argues in favor of fine-grained core assignment and DVFS when using NTC in many cores systems [7] .
D. Energy Efficiency in GPGPUs
With the rising popularity of GPGPUs, several research groups discuss strategies to make them more energy efficient [23] - [26] . Lee et al. [15] study the impact of frequency variation on the throughput of a GPGPU. This methodology, however, is the first to extend the evaluation to NTC tradeoffs, and it is an extension of previous research addressing NTC challenges. Massive data parallelism and extremely repetitive nature of the GPGPU applications is leveraged to adapt the operational region and configuration to the runtime application demand.
III. BACKGROUND
This section briefly describes the background concepts related to voltage stacking, PV, and NTC that are needed to better understand this paper. We also go over the microarchitecture of a GPU, giving special attention to the details relevant to this paper. Fig. 1 shows two power delivery schemes, and the conventional power delivery scheme [ Fig. 1(a) ] has all the elements in parallel. The elements can be interpreted as individual gates, functional units within a core, a whole core, and so forth. The stacked power delivery scheme has elements in series, and in the case shown in the figure, the number of stacking levels is two, since there are two elements stacked. In general, for voltage stacking, this is not a requirement, but in this paper, we only consider two levels. In a voltage-stacked system, the power delivered is the same as in the conventional case, but the delivery voltage is multiplied by n, the number of stack levels, and the current is divided by n, on average [8] .
Assuming that the power consumed by each element is the same, the voltage across each element is equal to the nominal V dd . Because roughly half the current flows through the system, the power delivery subsystem could operate more efficiently [9] . However, this is not always the case. When a full core is stacked on top of another, the activity on each core will depend on the program running and will create a mismatch between the stack levels and make V mid shift from V dd . This problem has been solved in different ways in the literature, for instance, by inserting an extra VR [8] , partitioning a core in units that have correlated power consumption [9] , or by stacking memory arrays for SRAM application [11] .
In this paper, we use GPGPUs that contain a large number of identical cores operating in lockstep, running the same program. This provides a very good starting point for voltage stacking. We note that "divergence" can occur and we address divergence later in this paper.
A. General Purpose GPU
In this section, we explore the microarchitecture of a GPU. General purpose computing on graphic processing units (GPGPU) is becoming pervasive as it provides excellent computing power for massively parallel applications. GPGPUs are mainly designed as a cluster of simple processors, as shown in Fig. 2 . Identical simple processors (lane 3 ) operate in lockstep inside a stream multiprocessor (SM) cluster, running identical threads (though processing different data).
The homogeneous structure of GPGPUs, both in hardware and application, makes them suitable for voltage stacking in order to manage PV. GPU Stacking stacks lanes inside SMs. All the other structures remain in a conventional configuration. The choice of stacking lane provides room for more configurability due to the larger number of lanes. It also provides a finegrained mitigation of PV, while at the SM level, techniques, such as multiclock and multivoltage domain, are possible.
B. Process Variation Model and Effects
So far, we have mentioned PV without specifying how PV can affect a chip. PV is the deviation from nominal specifications of a chip due to the difficulty of precisely control the fabrication process of a chip, especially at small feature sizes, e.g., lithographic lens, dopant density fluctuations, and others. Variation can be divided in two categories: systematic and random. From a design point of view, they differ on the granularity at which they occur. Random variation occurs at the transistor level. Systematic variation occurs at a much coarser granularity: within-die (WID) and die-to-die (D2D). Significant variation can be seen on the order of half the chip length [4] . Overall, random variation ends up being averaged out across a few gates in a path, since statistically, positive and negative variations will be observed in that path.
D2D variation cannot be addressed with on-die techniques, since the whole die is biased toward a process corner. Therefore, this paper aims to address WID variation. WID affects blocks in different regions of the design differently. This makes that each clock region is limited by the slowest block within it.
For a MOS transistor, a gate delay can be formulated as follows, where V dd is the supply voltage to the core, L eff is the effective channel length, and K and M are the fabrication constants [4] :
Due to the exponential component with the (V dd −V th ) term, the gate delay exhibits more sensitivity to V th variation at supply voltages close to V th . The delay variation with regard to L eff , however, is linear. This is particularly important for NTC applications, since V th will have more effect in those cases.
C. NTC and Power-Performance Tradeoff
The main argument for the adoption of NTC is its energy efficiency in terms of energy per operation [4] . This is because frequency has a linear relationship with the voltage down to the near-threshold region, while power is reduced in a cubic relationship with voltage. Thus, for instance, going from 1 to 0.5 V, the device delay increases by a factor of 2, while the power is lowered to (1/2) 3 of the original value, which results in a reduction of energy consumption, as a product of power and delay, to (1/2) 2 of its original value. The performance loss can be compensated with extra resources to support more parallelism. If we consider two times more resources to compensate for the two times increase in delay, the power would increase by a factor of 2, and the delay would decrease by half, leaving the energy reduction unchanged. So ideally, without considering the impact of PV and faults, energy consumption can be cut to (1/2) 2 for the same performance target.
However, as mentioned, the effects of PV, in particular of V th variation, are exponential with (V dd − V th ), meaning that the effect of variation is more significant at lower voltages. Therefore, our approach is particularly useful in NTC applications.
IV. GPU STACKING
In this section, we take a bottom-up approach to construct the GPU stacking model. We start by analysis the fundamentals of how stacking can help mitigate PV effects, and build upon that to get to the stacked GPU.
A. Process Variation Compensation With Voltage Stacking
GPU Stacking provides a unique opportunity to manage the effects of PV. An increased channel length (L eff ) or an increased threshold voltage (V th ) due to PV will result in higher impedance of the channel and slower device.
In a conventional parallel power delivery system, the adverse PV results in a lowered current I in i through the core i , as I in i = V dd /Z i . 4 Since the gate delay is inversely proportional to I in i , it will result in a higher delay and a slower core. To compensate for the lower I in i , higher voltage can be applied to the core or the body bias could be adjusted to reduce the gate delay. In short, the adverse effect of PV can be compensated by delivering higher voltage.
In a stacked configuration, the same current I c passes through different stack levels. Therefore, higher impedance of core i , due to adverse PV, results in a higher voltage across it
Equation 2 shows the voltage across each core. Index i is the core or lane number. Depending on the switching activity of the circuit, the equivalent supply to ground impedance of a core changes during execution. This will be referred to as Z (t). PV will bias Z according to the magnitude of the variation. Switching activity (or the running application) will change the transient aspect of Z . Using a performance and power simulator, we can obtain the core power traces, and when physical dimensions of the PDN are given, using a circuit simulator, we can analyze the impedance change over time [28] . In our experiments, since the power consumption changes over time, the impedance of the circuit has also varied over time
Utilizing the inherent feature of stacking is a key contribution of this paper. The core with higher impedance due to adverse PV will have a higher voltage drop across its power terminals. This results in a core speedup, relative to its speed without the higher voltage and with respect to a conventional power supply system, as a delay is inversely proportional to the voltage. This, of course, comes at the cost of a lower speed for the other core in the stack. Therefore, stacking enables the slower core to run faster, relative to its speed in a conventional configuration, at the cost of the faster core running slower. In other words, the effects of PV are intrinsically balanced in a stacked configuration.
Ideally, the variation effects in a stacked configuration converge to an average variation. For this simplistic example, let us assume linear effects of PV, 5 then one can expect the frequency of the stacked cores to converge to the average of the two cores in a conventional configuration. For example, a core with 10% variation compared to the nominal value runs at 0.9 f in a conventional configuration, and a core with −10% variation runs at 1.1 f . Stacking these cores would result in both cores exhibiting delay properties similar to the nominal values and run at about the nominal frequency
After all, the nominal value is nothing more than the mean properties across all the samples. A nonsymmetrical example would have nonideal compensation, e.g., c 1 = 0.8 f and c 2 = 1.1 f would result in an average of 0.9 f , which is better since the cores would have to run at 0.8 f if stacking was not applied. Now, we apply this reasoning to a SPICE simulation of a toy circuit, to test how this compensation works in a simple circuit. SPICE simulations (at 45-nm technology [29] ) are performed for an example case where inverters are configured in conventional and series with two stack levels. The test circuit consists of an inverter driving 4F O4 to exacerbate the delay effects with such a small circuit. The stacked configuration is supplied with 1.2 V. We test three configurations: nominal (ref), not stacked with PV (PV), and stacked with PV (s-xxxx). For the stacked configuration, the variation is set to affect the header inverter positively (i.e., shorter L eff , thus faster), and the footer negatively (i.e., longer L eff , thus slower).
In this example, there is a 20% PV in effective channel length (L eff ). If the channel length increases, the response time will be slower than the nominal nonvariation case (lines PV and ref at the bottom of Fig. 3 ). And if the channel shortens, the opposite effect will be observed (PV will be faster than ref at the top of Fig. 3 ). However, this also implies different power consumption for each, as seen earlier.
The simulation results show that with voltage stacking, the voltage rail between the two levels of the stack (V mid-pv ) settles around 0.63 V. This is because the header transistors have less resistance due to the shorter L eff . The header transistors are effectively supplied with 1.2−0.63 V = 0.57 V and the footer transistors with 0.63 V. In this case, there is a mitigation of the delay variation, shown by S-head and S-foot, which are closer to the scenario without PV. This is a reduction of more than half the delay variation introduced by PV. Next, we evaluate the effects of the V th variation on the inverter delays. The same scenarios are simulated, except that this time the variation is on V th . The transient response for this case is similar to that of Fig. 3 , and thus, the graph is not included. Fig. 4 summarizes all the experiments for different variation values from −20% to +20%. We evaluate both L eff and V th . Note how the delay variation is smaller with the use of stacking. Only one line is presented for the stacked configuration, since the stack position (header or footer) does not change the result.
The SPICE model confirms the premise of intrinsic mitigation of PV in the stacked configuration. Throughout the text, "variation" refers to the combination of the effects of different sources of PV, and its total effect on the frequency is measured, unless otherwise specified. However, as could be expected, the compensation is not ideal, being specially suboptimal for V th , due to the nonlinearity dependence between delay and V th . 
B. Which Lanes to Stack?
Now that we established how PV effects can be compensated by voltage stacking, we look into the problem of how to decide which GPU lanes should be stacked for optimal results. Stacking neighboring lanes can mitigate the PV effects regionally. For example, consider the die shown in Fig. 5 . For the best results, however, the stacking configuration has to be determined based on the observed variation, i.e., to stack a lane adversely affected by variation with a lane positively affected. For example, stacking lane 1 and lane 4 together and lane 2 and lane 3 as another stack would result in each running at about the nominal frequency. The best strategy is to cluster, in each side of the stack, lanes minimizing the standard deviation of variation (the rationale is that lanes with a similar PV require approximately the same compensation). Then, the cluster with maximum PV average in the header should be stacked with the one with minimum negative variation in the footer. However, this might not be trivial to find, since the number of possible combinations is large. A simpler approach, used in this paper, is to have the same number (for instance, N) of lanes in all the clusters. The clustering is made by simply picking the N lanes with the maximum variation in the header and clustering them with the N lanes with the negative variation in the footer. The process is repeated for the remaining lanes. This simpler approach works fine in the level of lanes, since their spatial proximity causes similar variations. Note that although this example shows an optimal case with opposite variations having the same magnitude, this is not a requirement.
Since the variation is not known until after fabrication, a configurable fabric is needed to group and stack the lanes based on the observed variation. We adapt the idea of a configurable power delivery fabric [14] , [30] to allow the connection of lanes that are not neighbors, because in terms of variation management, neighboring lanes are most likely, affected similarly. Our proposed fabric capitalizes on the fact that GPU Stacking is actually better suited for the connections of logic with the opposite variation effects. Note that this may cause a problem, given that it increases the path from V dd to Gnd. Thus, there is a design tradeoff here: on one hand, the compensation would be better if the stacked logic was farther apart in the chip, and on the other hand, if this distance is larger, the voltage droop due to voltage rails and switches is increased. That is another reason why stacking lanes is a good design choice as opposed to stacking SMs. Lanes are well constricted in space, within the SM, while SMs will be farther apart in the chip, still an SM is large enough to be used for the purposes of GPU Stacking, i.e., there is enough variation in the SM to allow for the type of compensation we aim, as observed in our evaluation.
To simplify the design, we propose to cluster the stacking of lanes. The clustering is a tradeoff between cost and complexity. To cluster lanes for stacking, we define shared net. Shared net is a common net that connects a number of lanes. For example, V dd is conceptually a shared net. However, we specifically use the term shared net for an intermediate net that connects a number of lanes in a stacked configuration (V mid ). The number of shared nets is a design parameter and changes the tradeoff between area and compensation granularity (more shared nets result in a more fine-grained compensation). Fig. 6 (a) shows a two-level stacked configuration of six cores with no shared nets. The stacking configuration is static, and is determined at the design time. As the variation is not known at the design time, and there is a spatial correlation in the variability, this scheme is not likely to provide the compensation opportunities. Fig. 6(b) shows the design with three shared nets. To simplify the proposed design, each lane is fixed to either header or footer at design time. This choice certainly reduces the freedom of the system, but because there are several lanes per SM, the loss is minimal. The resistance in Fig. 6(b) shared nets is not modeled in this paper. The scheme is similar to a multipower domain case, where multiple voltage rails are present and each part of the circuit connects to a different rail. The configurability only applies to the intermediate nets.
The number of shared nets is determined at design time based on the expected variation of the technology or the severity of variation effects on the design metrics and to the level they need to be managed. This decision has to be based on the variation profile for the fabrication technology. Given the knowledge on the variation, it is possible to calculate the expected power consumption for a block (in relative terms), e.g., using Varius-NTV models [4], the power information can then be used to calculate the expected V mid voltage for a given stacking configuration, which can finally be used to calculate the expected performance. The decision on the number of shared nets is then a tradeoff between the cost of adding an extra shared net and the extra performance boost gained. This is explored in our evaluation.
During postsilicon testing, each lane can be tested to characterize the observed variation for that particular lane. This increases binning time, but current chips already undergo this type of testing for speedgrade purposes. Once the effect of variation is known for each die, the lanes can join a cluster based on their observed variation. Note that the clustering is static and done once in the lifetime of the chip in a calibration step, right after fabrication. To allow for the postsilicon configurability, an array of power transistors or fuses can be used between each lane and each shared net. Power transistors are present in modern designs for power gating purposes. Although we did not consider them in this design, the reduced current passing through them (due to voltage stacking) will largely reduce their impact on the circuit.
When multiple shared nets are present, the clustering is done by stacking the same number of lanes (namely, n) on each side (foot or head) of the shared net. The lanes are sorted by variation (minimum delay) for both head and foot groups. The n first lanes in the head group are stacked with the n last lanes in the foot group. The next n lanes in each group are stacked together, and so forth. This configuration will have the maximum compensation within each SM, as lanes with opposed variation trends tend to be in opposite sides of the same stack, replicating the behavior observed in Section IV-A.
C. Divergence and Extreme Conditions
The last component that needs to be addressed is the presence of diverging code executing in each stack level, which has the potential to cause power mismatch and make V mid to diverge too much from safe levels.
However, V mid cannot be fixed by the means of an extra traditional VR, as it is usually the case in voltage stacking designs. The PV compensation comes from the fact that V mid is "floating," i.e., is not at a fixed frequency. Footer and header groups have different voltages instead. Nevertheless, it is possible that due to load unbalance (caused by GPU divergence), or e.g., extreme temperature conditions, the voltage difference is such that either level has not enough voltage to guarantee correctness. We call this "voltage starvation." We propose different mechanisms to handle such scenarios.
DA is inserted through the activation of parts of the lane that are not being used. For instance, if divergence is observed for long periods of time, this could shift the voltage toward the most active lanes. This shift can be canceled by adding activity in inactive lanes or inactive parts of lanes. When DA is inserted, the lane does not commit any change to the architectural state, nor does it execute stores, for obvious reasons.
"Lane turn-off" is a more drastic measure for extreme cases. In this case, there is no scheduling for one of the lanes in the level consuming more power than expected. This can only be done in architectures where each lane within the SM can execute a different code, and would require awareness in the scheduler to be able to maintain correctness. The actual "turning off" may be done in terms of power gating, which requires lane level power gating or in terms of scheduling/clock gating. If after the first lane is turned off, there would still be deviation, and a new lane is then turned off. This could lead to big impact on performance, and thus, DA is a preferable solution whenever possible.
Additional VRs are used for extreme cases. In the stacked logic, it is expected to have V mid floating, however, within a range that guarantees its functionality. Two small integrated VRs are used, one pull-up and one pull-down, which would be activated when V mid falls below 0.4 V or rises above 0.8 V, to guarantee correct behavior and avoid bit flips. Since these regulators only cap extreme cases, they are not used in regular operations; therefore, they can be small and their potential inefficiency is not problematic in the overall design. Since DA and lane turn off have to be implemented in the microarchitecture level, they are naturally slower, and thus, extra VRs are only used to give enough time for those to be activated.
The natural candidate for triggering these mechanisms is V mid , so in the case of the architectural mechanisms, the scheduler needs to be aware of the voltage during the regular operation of the GPU. VR is always connected and is triggered without any architectural intervention. In our experiments (details in Section VI), those mechanisms were never activated for any of the benchmarks tested, even when divergence was observed (for instance, in BFS).
D. Final Design
In this section, we present our GPU stacking final design. GPU Stacking does not change GPU organization (Fig. 2) , nor does it affect the placement. GPU Stacking divides the GPU into different power domains, one for nonstacked structures (e.g., caches and shared memory) and one "super" power domain with the stacked lanes (and associated register files). It is a "super" power domain because, to be precise, each stack is a power domain of its own, but this is not known before fabrication. Fig. 7 shows the proposed microarchitecture from a power delivery perspective. Fig. 7 shows a solution with four shared nets.
We adopt the stacked SRAMs previously proposed [11] (RF in Fig. 7) . Note that the SRAM-stacked design has read and write ports at nominal voltage, and thus, level shifters need to be added for read and write to the register file. There are multiple options available for level shifters suitable for voltage stacking [31] with different tradeoffs between area, power, and performance.
One concern about GPU Stacking is the area overhead due to shared nets. Introducing extra shared nets could increase the total amount of metal dedicated to power rails. On the other hand, GPU Stacking decreases the overall current of the chip and metal from V dd and Gnd rails could be reduced.
Let m b be the total metal budget for power rails in a chip. For a two-level-stacked system (as the one proposed here), the overall current is reduced by half I stack = I base /2. Thus, V dd and Gnd could have roughly half the metal budget as needed in the baseline 6 (m b /4 for each). For a system with n shared nets, each shared net takes I stack /n current. Thus, each shared net needs ≈n times less metal than V dd (m b /4/n). Consequently, in GPU Stacking, the total metal budget for power rails is
In other words, in a GPU Stacking configuration, it is possible to reduce the amount of metal dedicated to power rails. In our evaluation, we assume that the metal budget is kept constant instead (i.e., V dd and Gnd metal is reduced by 1/3 only), which in turns helps improving the PDN by reducing the resistance.
This solution is suitable for systems not bound by power density. However, if power density becomes an issue, a more elaborated PDN would be needed. It could be achieved by the observation that each lane only taps either V dd and shared net or Gnd and shared net. Thus, it is possible to reduce the amount of tracks dedicated to the unused rail on top of each lane and increase the density of the relevant rails.
In any case, a more thorough analysis of the area overhead would be required when implementing a design that uses GPU Stacking. However, since area overhead is usually not a major concern in modern chip designs, we do not put much emphasis in this analysis and leave it as future work.
V. EXPERIMENTAL SETUP
To evaluate GPU stacking, we start by determining what is the optimal configuration in terms of SMs and lanes for an NTC GPU. Non-NTC commercial GPUs vary in size from 2 to 16 SMs, and can have between 16 and 64 lanes per SM. After deciding on the baseline GPU for our experiments, we evaluate the potential of GPU Stacking to compensate for PV both in performance and power. Then, we evaluate the reliability of the PDN, as it is a main concern in voltage stacking proposals.
A. Baseline GPU Choice
A modified version of ESESC [32] is used to simulate a GPGPU. For power estimation, we use a GPGPU [33] , very similar to GPUSimPow [34] . McPAT takes the microarchitectural activity statistics from ESESC and calculates the power consumption of each component. This simulation setup provides both dynamic and leakage power and performance for the applications. The temperature dependence of leakage is also taken into account by ESESC.
We simulate GPGPUs with a range of configurations, created either by varying the number of SMs or the structures within each (e.g., a number of lanes in each SM). This is summarized in Table I . McPAT [33] tool estimates the power consumption of the GPGPU model, and only the on chip structures are modeled for this experiment. Since the stacking is applied within the SMs, the number of SMs does not affect the result per SM.
Benchmarks used are from popular suites Rodinia [35] , Parboil [36] , and CUDA SDK (bfs, cfd, convolution, hotspot, backprop, lbm, transpose, srad, and sgemm).
B. Process Variation Modeling
To evaluate the impact of GPU Stacking to compensate for the performance loss due to PV, we model PV following an existing methodology [37] . Briefly, VARIUS-NTV [4] (planar) and VARIUS-TC [38] (FinFET) are used to generate PV maps. VARIUS models both fine-grained (WID) and coarser grained (D2D) variation-both systematic and random components [39] . The systematic component is modeled using a multivariate distribution with a spherical spatial correlation structure and the random component, which occurs at the transistor level, is modeled analytically. VARIUS divides the chip into n small equally sized rectangles. Each grid point has a systematic variation of L eff and V th that are assumed to have normal distribution. The random variation of L eff and V th is treated differently because of the level of granularity at which it occurs, and it is assumed to be distributed normally and without any correlation [39] .
Given the GPU floorplan as the input, VARIUS-NTV provides die maps each with a specific PV case. The goal is to use VARIUS-NTV to consider the worst PV over maps or die maps and understand how V mid behaves in extreme cases of PV effects or an application. VARIUS-NTV also outputs normalized delay, normalized L eff , and effective V th for each component of the GPU die (lanes, caches, register files, and so on). The information from VARIUS-NTV is then used to calculate the expected power for each element in the stack and then the expected voltage on each shared net. The calculated V mid is then fed back into VARIUS-NTV to calculate the delay and power after compensation in the stacked configurations.
This experiment is performed for a different number of shared nets, where we compare our scheme with a conventional nonstacked baseline and against multifrequency domain, which has been shown to have promising results in mitigating PV effects [7] .
C. Power Delivery Simulation and Technology Node
To evaluate the V mid noise and voltage noise behavior in the GPGPU PDN, we adapt the methodology proposed by Leng et al. [28] . In short, an off-chip and on-chip PDN is simulated with cores modeled as current-controlled current sources, where the transient current is estimated by cycleaccurate microarchitectural simulation for each core. This type of approach has been used in multiple studies of this sort [37] , [40] , [41] . Thus, we use the power traces from ESESC for each lane.
Our SPICE simulations model the printed circuit board, the package [28] , and the on-chip PDN using the IBM Power Grid benchmark (ibmpg1t) [42] . Fig. 8 summarizes the complete PDN with the simulation parameters. The grid is represented by the four resistors in the box named "PDN," but the simulation is performed using the full grid. On-package capacitors (C p ) are used to stabilize the voltage on shared nets and smaller on-die capacitors (C d ) are used to eliminate fast transient response due to mismatch. Although the figure only shows one set of on-package caps with respective C4 bump, we use one set per shared net. The cores are modeled as variable resistances based on the power traces from ESESC. This methodology is compatible with current industry practices, and short of fabricating a chip, it is the best available method for this type of low-level analysis. For technology node, we use planar CMOS at 45 nm [29] and FinFETs devices at 15 nm [43] .
For between-level communication, we use the level shifter proposed in [13] which has been shown to provide good performance/energy/area tradeoffs [31] . Our SPICE simulations show that this circuit has one FO4 delay overhead at the NTC voltage, when communicating between different stacks; however, these level shifters can substitute buffers that were present in the design, minimizing the performance impact. Memory and caches are not stacked in this paper, and since the PV is being mitigated due to the stacking in the cores, it is expected that they will end up achieving a higher frequency than memory. To keep the improvement on the logic side, we consider an increase in the number of access cycles rather than reducing the frequency of the core.
We consider both single-clock domain and multiclock domain, in which case, each SM runs at a single frequency. In order to meet timing, this will be the frequency of the slowest lane (after compensation). It is possible to have different frequencies for different SMs, but in this paper, only one frequency is considered for the whole GPU.
VI. EVALUATION
We first look into the baseline selection results, since all other results will be built upon this GPU. Then, we present the main results on how GPU stacking can mitigate PV effects and present other advantages of our approach. Then, we evaluate different aspects related to the stability of our approach to get a comprehensive idea of how GPU Stacking affects the power delivery of a chip. Finally, we investigate how the tradeoffs presented by GPU stacking are changed when FinFET are used and finish our evaluation discussing design and fabrication aspects of this new approach.
A. GPU Sizing and Baseline
We start our evaluation with a careful analysis of performance, energy, and area tradeoffs of GPUs using the NTC. We consider different voltages and number of SMs. Fig. 9 summarizes energy-delay (ED) and energy-delay-area (EDA) products for 1x with a different number of SMs. The y-axis shows the normalized value for each metric with reference to the 1x/4 SM configuration at 1 V supply. The x-axis is V dd . In general, energy decreases as the V dd approaches the near-threshold region. Then, the delay starts to degrade more rapidly, increasing the energy consumption, mainly due to clocked logics and leakage.
To understand the effect of an NTC GPU, let us examine the 4SM and 8SM configurations at both 1.0 V and the NTV voltage of 0.6 V. Clearly, 8SM has a higher EDA than 4SM at 1.0 V. However, as the V dd approaches the near threshold, the EDA of both configuration decreases. Moreover, at around 0.6 V, 8SM configuration has a lower EDA than 4SM. This means that the 8SM configuration is more efficient at lower voltages. In addition, the delay metric for 8SM configuration at this point is the same as the delay of the baseline 4SM running at 1.0 V. This shows that investment in extra resources pays off as the V dd approaches the near-threshold region by maintaining the performance within 4.8% of 1x/4 SM configuration operating at 1.0 V while reducing the power consumption to about 43% of baseline, at the cost of more area.
Another observation is that the optimal configuration for different metrics changes by changing V dd . For example, Fig. 10(a) shows the design space for ED at 1.0 V. The optimal configuration is 1x/6 SM. Bigger structure sizes for cache or number of lanes could increase the performance, but the increased power makes such a tradeoff less desirable due to power budget constraints and possible thermal issues. Fig. 10(b) shows the design space for the same metrics at 0.6 V. At this condition, the optimal configuration is 8SM. In addition, the relative efficiency of bigger structure sizes (e.g., 1.5×) increases. This implies that the architectural parameters, such as cache or RF size, should be reconsidered for maximum efficiency as the operating voltage changes.
B. Benefits of GPU Stacking on PV
Now that we established the baseline configuration of the NTC GPU, we look into the positive effect of GPU stacking on PV. We used the variation maps generated by VARIUS-NTV, where each map corresponds to a die. Then, we estimated performance and power for each die. Each die presents a unique PV, and thus, GPU Stacking will have different effects on each. We also look at the effect of different numbers of shared nets in the design. Fig. 11 shows the histograms for performance and power, i.e., the y-axis is the percentage of chips (out of 10k chips) and the x-axis is the performance/power. Performance and power are normalized to the value obtained in the case with no PV. Both nonstacked and stacked (xSN, where x is the number of shared nets) methods are shown. The nominal case, i.e., with no PV, would be in the x = 1 mark. We notice that without stacking and with PV, the average performance is around 50% of the nominal. Using one shared net per SM does not improve the nonstacked case much, since all the lanes share the same shared net. Two, four, and eight shared nets provide good design solutions with eight being pretty close to four in terms of power, but with slightly lower power. The best configuration is arguably eight shared nets in this case, but two shared nets present a good tradeoff between design complexity and results.
Overall, GPU Stacking delivers about 75% of the performance with 75% power compared with the no variation conditions. This represents a reduction in the degradation due to PV: 37% in performance and 39% in power compared with the conventional configuration. The increase in power is due to the increased frequency, but energy per instruction remains roughly the same. This may seem like a no-gain approach, but means that GPU Stacking is able to reduce the effects of PV, delivering a chip that is closer to ideal scaling. For the sake of comparison, when using ASV with four power domains, it would be possible to deliver only about ≈62% of nominal performance, on average, while with four shared nets, it would be possible to deliver ≈70% of nominal performance, on average.
Another way of mitigating the effects of PV is to use multiclock domain [15] . Since each SM in GPUs operates in lockstep, the maximum number of clock domains possible is equal to the number of SMs. However, our technique is orthogonal to multiclock domains and could be applied in combination to it. Thus, we repeat the previous experiment, but considering multiclock domains. Our results (Fig. 12) show that, by only using multiclock domains, it is only possible to slightly improve the performance to close to 60% of the nominal performance. When also applying GPU stacking, there is a small improvement with 2SN; 4SN and 8SN have a very similar performance, but four shared nets have better power consumption. Thus, 4SN seems to be the best configuration with two shared nets providing a good tradeoff point. In summary, combining multiclock domain and GPU Stackingdelivers about 80% of the nominal performance with 70% of the power, which is a 20% improvement over multiclock domain only.
C. Other Advantages of GPU Stacking
Besides compensating PV effects, using GPU Stacking has other advantages. The increase in the supply voltage has some indirect advantages, some of which are quantified in this section. They share two main causes: VRs are more efficient at higher voltages and the overall current in the circuit is reduced to roughly half [8] , [13] . Using results from Hong and Kim [44] , we estimate memory power as 17% of the total GPU power. The memory power is not affected by GPU Stacking. Some of the advantages of GPU Stacking are related to VRs: VR area is reduced by roughly two times with half the current [45] . In the Intel Haswell integrated VR, each 2.8-mm 2 cell can deliver a 25-A maximum current. A modern low-end GPU consumes ≈55 W [46] . VR efficiency is a function of both V dd and output current [47] . Considering both the efficiency from increasing V dd from ≈0.6 V (at the nearthreshold region) to ≈1.2 V and reducing by half the total current, we expect an improvement in VR efficiency of 12%. Also, there is a ≈50% reduction in current drawn by stacked logic that can yield a ≈41% reduction in the VR area.
The number of pins and pads is mainly determined by the total amount of current flowing through them. To keep the current per pin constant, it is now possible to reduce the number of pins. Again, the current related to the logic decreases to roughly half, but the current in memories is the same. Once more, this yields a reduction of 41% in the total number of power pins. Note that the number of pads dedicated to V dd can also be decreased, but pads are now needed for shared nets, since on-package decaps are used. The overall number of pads is not expected to change. Table II summarizes the expected variation in multiple chip parameters due to GPU Stacking.
D. Analysis of PDN Stability
We also looked into the PDN stability under different optics to make sure that GPU Stacking can reliably execute different applications keeping V mid stable and by looking at power distributions concerns across the chip.
The main concern is that divergence between application threads could cause V mid to diverge from acceptable levels. In typical GPGPU applications, threads exhibit very similar activity rates. This minimizes the possibility of a load mismatch in the stack. In our experiments, we observe that the power consumption of lanes is within 5% of each other 98% of the time and within 10% and 99.2% of the time. Power lane1 /Power lane2 averages 1.000024 with a standard deviation of 0.070. The sampling rate for our measurement is on the order of 1-10 MHz.
One source of concern is a peak difference which is as high as 40% in our experiments. We observe that mismatches are higher than 30% only occurred for the backprop benchmark but only 0.1% of the time for that benchmark. Since the elevated mismatches are observed in a very short interval, they can be handled by decoupling capacitors.
To examine whether the achievable range of decoupling capacitance in the design is enough for the observed mismatches, we ran an SM-like design through synthesis and back-end design down to GDS. We then extracted the capacitance of the supply nets. Our experiments show that such a design using standard cell decoupling capacitance cells (dcaps) would have a total capacitance of 1.4 nF/mm 2 . At the superthreshold region, with a power density of about 1 W/mm 2 , the time constant for the power supply RC circuit would be on the order of a couple of nanoseconds. Such a small time constant is not enough to sustain the transient mismatches that appear on the order of 100 ns to 1 µs. At the near-threshold region, with smaller power density, the time constant would be on the order of 10 ns which is still not enough. Therefore, we use of on-package capacitors that are commonly used for PDN stabilization. Adding additional capacitors for the shared nets will not increase the package pins [9] .
In the PV case, V mid is expected to deviate from the nominal voltage, and this behavior is desired, since it is the source of the PV compensation. We evaluate how the load mismatch affects the voltage available for both levels of the stack by carrying a SPICE simulation of the model presented in Fig. 8 . Each stack contains 16 lanes, 8 in the header and 8 in the footer, a total of 4 stacks (2 SMs with 32 lanes each) are hooked in the grid, equally distributed, and each V mid has two 5 µF on package decoupling capacitor (one between V mid and Gnd and the other between V mid and V dd ). BFS and backprop are run, one in each SM (those where the two benchmarks with higher mismatch between lanes). Fig. 13 shows the on-chip transient voltage for each stack level during execution, one of the stacks is omitted for clarity. The voltage source is 1.2 V. Instead of plotting the voltage with relation to the global Gnd, we plot the local voltage difference, which is more meaningful. Voltage for each stack level stays within 10% the expected voltage for compensation, showing a very good balance. This also implies that the minimum voltage for our technique to work is V th + 10%.
Another concern regarding the PDN is lateral current and IR drop. The main concern here is that if stacked lanes are further apart within the chip, there will be increased resistance for current to traverse that distance.
The total current for V dd in GPU Stacking is roughly half the current than the conventional case, and therefore, one would expect decreased voltage droop. On the other hand, given a fixed budget for power delivery, the insertion of new power rails requires that some of the resources used by V dd and Gnd in the conventional scenario are used to shared nets instead. Since the current in this rails is expected to be much smaller, the resource reduction in V dd and Gnd is small. We consider that 1/3 of the metal used to power delivery is used to shared nets. Fig. 14 shows a 2-D color map of the IR drop for V dd , both in the conventional configuration and in GPU Stacking for the BFS benchmark from our SPICE simulations, and already considering the reduced resources for V dd , the contour lines are traced every 0.02 V. The conventional configuration was scaled up from 0.6 to 1.2 V for better comparison.
For V mid , we start by finding the worst case from the variation maps, i.e., the case with a maximum power difference between the stack levels. We note that, since we are stacking lanes, within an SM, and given the spatial correlation of the variation, in most cases, the variation is well below this maximum. To estimate the resistance, we use the IBM Power Grid Benchmark (ibmpg1t), which is properly scaled to estimate the equivalent grid of one shared net. The transient simulation considers the maximum variation case. The maximum observed IR drop in V mid was 65 mV in our simulations for all benchmarks, which is 5% of the whole supply voltage (1.2 V) or 10% of the NTC voltage of 0.6 V, and since we are using over 200 mV from V th , this is well within acceptable margins.
E. Stacking FinFETs Versus Planar CMOS
Now that we looked into the benefits of GPU Stacking and have analyzed the stability and the impacts on the PDN, we look into how FinFETs affect the tradeoffs observed for GPU Stacking. We want to make sure that our proposal is still useful in newer technology nodes.
The first observation is that FinFETs seem to be less sensitive to PV effects than CMOS devices. Fig. 15 shows the energy versus delay for both CMOS and FinFETs with L eff variation from −15% to 15%. As we increase L eff variation in FinFETs, the effect on the ED is much smaller than the planar CMOS, which indicates that the effect of PV in energy consumption of FinFETs is not as significant as it is in planar CMOS. Since GPU Stacking relies on the difference of power to shift V mid and mitigate PV, it cannot be directly applied in that case. In addition, there is intrinsically less PV to compensate in this case. Nevertheless, it is still possible to use GPU Stacking to compensate for PV effects on FinFET devices. To do so, we propose the use of the extra VRs to force the V mid voltage to a beneficial level. GPU Stacking is designed and evaluated in the same way, but during the binning process, a voltage level is chosen for each shared net based on the variation of the lanes assigned to it. Now, we look into how much PV compensation can be obtained by applying GPU Stacking to a FinFET-based GPU design. We start from the variation maps and calculate the expected performance and power for each chip, considering the best forced voltage for each shared net. Fig. 16 shows the histograms for performance and power. The y-axis is the percentage of chips (out of 50k) and the x-axis is performance/power.
In Fig. 16 , performance and power are normalized to the values of the no-PV case. We note that in this case, the no stacking version already presents the performance and power numbers very close to nominal (≈90%), still GPU Stacking is able to shift the curve above to 95% performance, on average. Although these results are less impressive than the CMOS counterpart, the compensation effect is still observed with FinFET devices. The multiclock region experiment was yielded very similar results and omitted for the sake of space.
F. GPU Stacking Practicality Issues
Finally, we finish our evaluation by looking into some practical aspects regarding fabrication, design, and power gating. We briefly discuss those issues in this section.
1) Extreme Variation Cases:
In some extreme cases, the difference in variation between the stack levels may be significantly elevated, reducing the voltage in one of the levels to nonacceptable levels (compared with V TH ). After fabrication, chips are divided in performance/power grades or discarded according to the variation. This should also be the case in the GPU Stacking case. In practice, this is done during binning, where timing characteristics of the chip are assessed. In this paper, to evaluate how GPU Stacking affects yield, we use VARIUS-NTV [4] to calculate the expected V mid for each stack. We then compare this value to the V th of each lane in the stack. If the supply voltage in the lane is smaller than 1.2·V th for at least one of the lanes, we consider the chip discarded. Our simulations show that, in such a scenario, 1.8% of the chips would be discarded. A more aggressive approach would be to disable lanes which do not meet such requirement, but we do not consider this case.
2) In-Rush Current and Power Gating: GPU Stacking reduces the current drawn from the PDN by roughly half during the regular operation and start-up phase. Our SPICE models were simulated for both stacked and nonstacked configurations, and showed that in-rush is indeed smaller in GPU Stacking (data are not included).
Power gating can be applied in addition to the GPU Stacking. The only requirement is that pairs of lanes need to be powered off together (one in the head and one in the foot group), but there is no need to power gate one entire SM at a time.
3) Implementation Flow: GPU Stacking operates in the physical level only and does not make changes in the register transfer level. However, the physical implementation flow needs to be altered. GPU Stacking creates voltage rails that will be connected to the "local" GND rail of some portions of logic and to the "local" V dd rail of others. The idea of having multiple power rails is similar to multiple power domains. Power transistors can then be used to open or close the circuit and choose which shared net each core will tap. Another option is to crack open some connections, since the clustering decision does not change after fabrication.
GPU Stacking also requires isolated wells to avoid unwanted body bias effects between stack levels, and thus, either fully depleted SOI (FD-SOI) or triple-well technology is needed [8] . Although this is restrictive, the FD-SOI has high availability and triple-well only requires an extra mask during fabrication.
Sign-off of the stacked configuration may also be a concern. However, we note that a corner-based approach can be used to determine the possible ranges of power and timing of a stacked system. The procedure should consider what happens when stacking clusters of different types, i.e., FF and SS, FF and FF, and SS and SS. Although this increases the effort for sign-off, it can be trivially implemented in EDA flows. This may be a bit of a pessimistic approach, since it will show worst case configurations that will likely not happen.
VII. CONCLUSION
We present GPU stacking as a method to manage the effects of PV. We show that the stacking of cores with the opposite variations tend to balance the variation effects that would have appeared in a conventional configuration. To maximize the balancing effect, cores with the opposite variations should be stacked, which require postsilicon configurability. We propose a clustering technique to make such a configurability feasible. The homogeneous nature of GPGPU applications makes them suitable candidates for GPU stacking. Previous voltage stacking publications only analyzed multicores and required complex circuitry to stabilize the voltage. This paper is the first to use stacking in the context of PV and to propose a floating middle rail. We show that the stable nature of GPGPUs allows for the use of only decoupling capacitors to stabilize the power delivery.
This paper provides a detailed evaluation of NTC with GPGPUs and the idea of GPU stacking. We first carefully size a GPU for NTC operation, achieving 43% power savings, with only 4.8% performance degradation. We then apply GPU Stacking to manage PV, which impacts NTC circuits more than circuit in the superthreshold region. The homogeneous nature of GPGPU architectures and applications makes them a very interesting candidate for exploration in the extreme domains with both low voltages and small feature sizes. We show that stacking can increase the performance under PV at near threshold, on average, by 37% compared with the traditional (not stacked) configuration, delivering 80% of the performance compared with the no variation (ideal) conditions. Even when using multifrequency domain, GPU Stacking is able to further improve PV compensation by about 30%. Although this technique is more suited to GPUs due to their homogeneous nature, it could be adapted to use in a scheme similar to CoreUnfolding [9] , where the authors leverage power consumption correlation among parts of a single core.
ACKNOWLEDGMENT
Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the funding institutions.
